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Abstract 

Background: Annotating mammalian genomes for noncoding RNAs (ncRNAs) is nontrivial since far from all ncRNAs 
are known and the computational models are resource demanding. Currently, the human genome holds the best 
mammalian ncRNA annotation, a result of numerous efforts by several groups. However, a more direct strategy is 
desired for the increasing number of sequenced mammalian genomes of which some, such as the pig, are relevant as 
disease models and production animals. 

Results: We present a comprehensive annotation of structured RNAs in the pig genome. Combining sequence and 
structure similarity search as well as class specific methods, we obtained a conservative set with a total of 3,391 
structured RNA loci of which 1,01 1 and 2,314, respectively, hold strong sequence and structure similarity to structured 
RNAs in existing databases. The RNA loci cover 1 39 cis-regulatory element loci, 58 IncRNA loci, 1 1 conflicts of 
annotation, and 3,1 83 ncRNA genes. The ncRNA genes comprise 359 miRNAs, 8 ribozymes, 1 85 rRNAs, 638 snoRNAs, 
1,030 snRNAs, 810 tRNAs and 153 ncRNA genes not belonging to the here fore mentioned classes. When running the 
pipeline on a local shuffled version of the genome, we obtained no matches at the highest confidence level. 
Additional analysis of RNA-seq data from a pooled library from 1 0 different pig tissues added another 1 65 miRNA loci, 
yielding an overall annotation of 3,556 structured RNA loci. This annotation represents our best effort at making an 
automated annotation. To further enhance the reliability, 571 of the 3,556 structured RNAs were manually curated by 
methods depending on the RNA class while 1 ,581 were declared as pseudogenes. We further created a multiple 
alignment of pig against 20 representative vertebrates, from which RNAz predicted 83,859 denovo RNA loci with 
conserved RNA structures. 528 of the RNAz predictions overlapped with the homology based annotation or novel 
miRNAs. We further present a substantial synteny analysis which includes 1 ,004 lineage specific de novo RNA loci and 4 
ncRNA loci in the known annotation specific for Laurasiatheria (pig, cow, dolphin, horse, cat, dog, hedgehog). 

Conclusions: We have obtained one of the most comprehensive annotations for structured ncRNAs of a mammalian 
genome, which is likely to play central roles in both health modelling and production. The core annotation is available 
in EnsembI 70 and the complete annotation is available at http://rth.dk/resources/rnannotator/susscrl02/versionl.02. 
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Background 

With the sequencing of the human genome it became 
evident that protein coding sequences only make up 
about '^1.2% of a mammalian genome [1]. It has since 
then been a main challenge to analyze the remaining part 
of the genome. A non-negligible fraction of the genome 
consist of no needing RNAs (ncRNAs), an abundant class 
of genes which are not translated to proteins and instead 
function directly in the RNA form. In spite of the progress 
in both computational RNA biology and experimental 
techniques, annotating ncRNAs in mammalian genomes 
remains a major challenge. The challenge is to carefully 
pin point only functional ncRNAs, which can be done 
through similarity to existing ncRNAs or de novo through 
a combination of RNA structure prediction and RNA-seq 
data. The latter may with subsequent effort reveal full 
length transcripts, yet with unidentified function [2] and 
functional assignment will require tedious lab work on 
each instance. 

The human genome is the most comprehensively anno- 
tated mammalian genome. Yet, the annotation of ncRNAs 
is still an ongoing effort, which was initiated with the 
release of the human genome more than a decade ago 
based on similarity search [3] and have expanded sev- 
eral times most noteworthy in the recent ENCODE 
and GENCODE projects [4-6]. Within the most recent 
GENCODE annotation ~9300 long ncRNAs (IncRNAs) 
have been identified [6] by manual intervention, though 
their functionality in general is yet to be uncovered. 

More and more mammalian genomes are being 
sequenced (the NCBI website as of May 2013 lists assem- 
blies for 119 mammalian genomes). Thus it becomes 
important to annotate ncRNAs in mammalian genomes 
at least in a version containing the most well established 
RNAs and RNA families. The annotation of new mam- 
malian genomes will in general build upon the existing 
annotation of the human genome; however, the corre- 
sponding number of years of effort is not feasible and a 
more direct strategy is needed. 

The most comprehensive collection of structured RNAs 
is found in Rfam [7,8], which has been constructed in 
a large collaborate effort by the (noncoding) RNA com- 
munity. However, even in Rfam the annotations lack 
explicit functional annotation and the models for struc- 
tured RNAs do not always hold a strong discrimination 
power to pseudogenes, as these may be incorporated into 
the Rfam seed sequences on which the structure models 
are based [9]. It is in general not possible to distinguish 
between real genes and pseudogenes, except for a few 
cases, exemplified by tRNAs [10]. 

Analysis on early and incomplete pig sequencing data 
for structured RNAs have been carried out previously 
both on genomic as well as on EST data [11-13]. Whereas, 
the annotation of EST sequences was based on de novo 



assembly comparison of the resulting contigs to related 
organisms, the annotation of genomic DNA as frag- 
mented due to low coverage and annotation was carried 
out for all of the contiguous pieces in a similar way. Inter- 
estingly the EST annotation provided additional expres- 
sion profiles as well. These studies on the incomplete 
genome included discovery of homologues as well as de 
novo predicted ncRNAs. As it was shown for the full 
transcriptomic analysis on the pig ESTs, the most diverse 
expression is in brain and testes tissues [12] and par- 
ticular brain and developmental tissues hold a relatively 
higher expression of putative ncRNAs [13] . The search for 
IncRNAs in the pig genome has just begun and includes 
expression analysis of an mRNA-like ncRNA [14], sugges- 
tion of an mRNA-like ncRNA [15] and two other cases in 
the IncRNA database [16]. Other studies indicate a larger 
number of putative IncRNAs in pig [17]. Also, recent 
work has indicated a further potential for ncRNAs in Sus 
scrofa, for example, by the use of the genomic sequence 
to map small RNAs in the cumulus-oocyte complex and 
in early embryos in the pig [18]. Likewise for the discov- 
ery of miRNAs in the pig intestine [19] and in pig skeletal 
muscles [20]. While these provide initial information, the 
sequencing of the full genome for the first time opens for 
a systematic analysis and annotation of ncRNAs in pig. By 
annotating the pig genome for ncRNAs a range of related 
production animals e.g., cow are being annotated as well 
using comparative genomics in this study. 

When annotating genomes of model animals such as the 
pig it is also of interest to narrow down organisms or lin- 
eage specific ncRNAs, as one would like to avoid these in 
studying basic genetic mechanisms which can be used as 
a template in for example human diseases. Conversely, the 
lineage specific ncRNAs can potentially be highly relevant 
when studying the underlying genetic mechanisms of not 
only animal health, but also production. 

To address the problem of ncRNA annotation of a 
mammalian genome using a reasonable amount of man- 
power resources, we here focus on structured RNAs, a 
main characteristic of many ncRNAs and regulatory ele- 
ments in UTRs of mRNAs. We introduce a pipeline using 
the complementing resources of sequence or structure 
homology search, and small RNA-seq data analyzed for 
the potential of miRNAs. On top of this we address de 
novo annotation by employing RNAz [21]. The pipeline 
exploits that a range of related organisms exists where the 
genomes match up in regions extending beyond the indi- 
vidual structured RNAs. This enables manual curation of 
the structured RNAs and allows for synteny analysis of the 
annotation produced in this paper. The recent sequenc- 
ing of the pig genome has provided insight into a range 
of medical, production and evolutionary aspects [22]. The 
full genome sequence, furthermore, makes it possible to 
obtain a good overview of the structured RNAs, genes as 
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well as well as regulatory RNA structure, all with relevance 
for the mentioned topics. 

Results 

The modules of the pipeline 

In the following section we present a pipeline for annota- 
tion of structured RNAs. The pipeline is targeted towards 
annotation of complete vertebrate genomes, but could in 
principle be used to annotate other types of genomes or 
even a collection of individual sequences. The pipeline 
also adds additional information to annotation in the form 
of synteny and contextual information such as close-lying 
protein coding genes. 

The pipeline for annotation of structured RNAs 
presented in Figure 1 is based on three main modules, (/) 



annotation, [ii) pairwise and multiple alignments, and {Hi) 
tagging of the annotation. The last module adds additional 
information to the annotation. 

Module (i): the annotation pipeline. Here structured 
RNA loci in a given input sequence(s) are annotated (an 
RNA locus is in this work defined as a set of overlapping 
RNA structure or sequence annotations). The annotation 
is based on a number of methods, classified as either 
sequence based homology search, structure based homol- 
ogy search, RNA class specific methods, or de novo struc- 
tured RNA prediction. The sequence based homology 
search is presently based on BLAST [23] and a number of 
databases of known ncRNA sequences (see Section RNAs 
of the individual annotation tools for details). Structure 
based homology search is performed with Infernal 
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Figure 1 Pipeline. The modules of the RNA pipeline. Module 1 (annotation): The annotation pipeline takes as input any number of sequences and 
runs a number of external annotation tools on it (see text for details). This leads to the initial annotation of the RNA loci in the sequence. A naming 
and resolving tool decides on the final annotation of the locus. The 3,393 ncRNA genes cover 1 1 conflicts of annotation, 34 loci moved to the 
medium confident annotation during the curation step, 165 novel miRNA loci found exclusively by miRDeep, and 3,1 83 ncRNA genes found by 
homology. LncRNA loci and cis-regulatory elements are annotated separately. Module 2 (multiple alignments):The multiple alignment pipeline runs 
on a genomic scale and aligns the genomic sequence of the input genome against any number of other genomes, finally forming multiple 
alignments in MAF blocks. Module 3(post processing): The post-processing part of the pipeline adds context to the RNAs, which in many cases will 
allow for a curation of the structured RNA loci. The numbers in parenthesis are those obtained after the removal of 34 annotations as part of the 
curation procedure: 12 tRNAs, 15 homology based miRNAs, and 7 denovo miRNAs. 
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[24] and covariance models (CM) from Rfam [7,8]. Class 
specific methods are typically based on sequence or struc- 
ture homology search but with some additional knowl- 
edge about the RNA, at present tRNAscan-SE [10], and 
RNAmmer [25] are built into the pipeline. snoStrip [26] 
is another class specific tool used in the present study, but 
is not yet fully integrated into the pipeline. The predic- 
tion of structured de novo ncRNAs requires information 
in addition to the sequence to be reliable, either in the 
form of multiple alignment of (part of) the input sequence 
used with tools like RNAz [21] or CMf inder [27], or in 
the form of RNA-seq data used with tools like miRDeep 
[28]. This submodule of the pipeline thus requires addi- 
tional user supplied input data, which sets this submodule 
aside from the other three classification methods, which 
work on the input sequence alone. Levels of certainty of 
the RNA predictions are provided, as high, medium, low 
corresponding to the different cutoff levels for the indi- 
vidual tools, and the input to the annotator module will 
depend on the input from the tools as well as the confi- 
dence level. An overview of the cutoff levels are given in 
Table 1. The next step is to mark and resolve the conflicts 
that are introduced by the individual tools, or introduced 
by the merging of the results of the tools. Some tools 
are inherently conflict free like RNAmmer, tRNAscan-SE 
and snoStrip. Some tools, however, are supposed to 
be conflict free, like the Rfam models investigated with 
Infernal. Also when employing a tool like BLAST on 
a number of databases it is expected that a number of 
conflicts are introduced. The conflicts introduced by run- 
ning BLAST were partly resolved by choosing the hit 
with the lowest E-value within each database and class 
of RNAs, e.g., miRNAs or snoRNAs. Conflicts between 
Rfam families introduced by running Infernal were 
marked. Conflicts of annotation introduced by differ- 
ent databases and conflicts introduced by different tools 



between databases are only marked if they represent a 
conflict in classification, e.g., a genomic locus is marked 
as a conflict if annotated as both snoRNA and miRNA. 
Different tools may indicate different names for the same 
RNA. We assumed that if two tools both classified a loci as 
belonging to the same class, but that the name used by the 
two tools were different, this did not indicate a real con- 
flict, but only a conflict in naming. Therefore a name of 
the RNA is chosen based on a specific ranking of the tools 
annotating the loci. In general, the class specific meth- 
ods add extra information to the RNAs that they annotate 
{e.g., tRNAscan-SE tests for the codon) or they anno- 
tate full length RNAs, e.g., RNAmmer, where BLAST or 
Infernal may be unable to. The predictions made by 
high confident BLAST runs will normally be more con- 
fident than the ones obtained from Infernal on the 
same model. Therefore, in the naming procedure class 
specific methods are preferred over BLAST, which in turn 
is preferred over Inf ernal/Rfam. For a full overview 
see Additional file 1: Table SI. De novo methods like 
miRDeep and RNAz are only used as a last resort in the 
naming and classification procedure. 

The result of the two steps of module (0 is an uncurated 
annotation which is to be analysed and curated in module 
{ii) and (///). 

Module (ii): Pairwise and multiple alignments. This 
module works on a full genome and aligns the genome 
against any number of related genomes. A user supplied 
phylogenetic tree is needed as well as a decision on the 
parameters to use for the pairwise alignments. Three sets 
of parameters for the pairwise alignments are provided: 1) 
highly related genomes, 2) related genomes, 3) distantly 
related genomes. These parameter choices are based on 
those used for the UCSC 46-way alignment of the human 
genome, where the three sets approximate the respec- 
tive choices made for primates, placental mammals, and 



Table 1 Cutoffs for individual tools at different global confidence levels 



Com p. strategy 


High 


IVIedium 


Low 


BLAST 


95% id 95% leng* 


92.5% id 92.5% length 


90% id 90% length 


Infernal 


BLAST E=1e-3 


BLAST E=0.1 


No BLAST filter 




Infernal E=le-3 


Infernal E=le-3 


Infernal E=le-3 


Infernal miRNA 


Not applied 


BLAST E=le-3 
Infernal E=1e-9 


BLAST E=0.1 
Infernal E=le-6 


tRNAscan-SE 


High 


High 


As is 


RNAmmer 


As is 


As is 


As is 


snoStrip 


As is 


As is 


As is 


miRDeep 


Hand cleaned 


As is 


As is 



The results from the individual tools are merged at 3 different cutoff levels: high, medium and low. This table shows the correspondence between the cutoff levels of 
the merged annotation and those of the tools. For each computational (comp) strategy, we define these levels. Note that Infernal screen have been divided into 
Infernal (without miRNAs) and Infernal miRNA which is on/y for miRNA families. The Infernal results were filtered by the family specific gathering scores as 
well. For As is, we refer to the programs default values for the respective versions (see the Methods section for version numbers) without subsequent cleaning. 
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non-placental mammals as well as other vertebrates. The 
module is a framework built over pairwise alignments 
with LASTZ [29] and subsequent chaining and filtering 
with the UCSC [30] tool chain. 

Module (in): This module provides genomic context 
to the RNA annotation, or additional knowledge about 
the annotations. At present the following submodules 
are implemented: 1) Genie context, requires the Ensembl 
(protein coding) gene annotation. 2) Clustering of RNA 
loci, checks for other RNA loci annotated in the vicinity 
of the one in question. 3) Conservation of the sequence in 
other organisms derived from the pairwise alignments. 4) 
Conservation of structure derived from the multiple align- 
ments using RNAz. Based on this information a curation 
of particular RNA loci or RNA classes is possible, which 
leads to the curated annotation. 

The pipeline for RNA annotation presented above in 
module (0 is a flexible framework for RNA annotation, 
which may be extended by any number of future tools. 

RNAs of the individual annotation tools 

The pipeline presented above was used to annotate the 
pig genome version 10.2. In this section we present and 
discuss the results of the RNA annotations obtained from 
the individual tools ordered by their classifications as 
sequence based homology search tools, structure based 
homology search tools, and class specific search tools. 
Details about the cutoffs used for the tools at high, 
medium and low confident levels are given in Table 1, 
while the medium and low confident results are mostly 
confined to Additional file 1. 

Sequence based homology search using BLAST was per- 
formed against a number of databases (see Methods 
section for details; an overview of the databases is also 
given in Additional file 1: Table S2). The databases have 
not been redundancy reduced and the sequences are used 
without modifications. The only exception is the miRNAs 
from the Rfam seeds, which had special problems and they 
were therefore excluded from the high confident results. 
The reason is that at least some Rfam miRNA families are 
inconsistent with miRBase (an example is mir-28, mir-708 
where high confident BLAST Rfam hits are observed on 
the same locations on opposite strands, however miRBase 
only match one of the two families to a given location and 
the families are only weakly related. In the Rfam version 
11 the two families are allowed some overlap by declaring 
them part of the same clan). 

We found 1,032 high confident (95% id over 95% of 
the query length) RNA loci belonging to a total of 507 
different RNA families see Additional file 1: Table S3 
for the high confident sequence homology search results; 
the medium and low confident results are included in 
the table for completeness. Special care must be taken 
when using miRBase for annotation, since it contains both 



experimentally determined miRNAs and miRNAs deter- 
mined by homology. In a few cases even purely in silico 
discovered miRNAs have entered the database {e.g., [31]). 
This resulted in removal of 15 miRNA loci, which were 
found by homology to sequences without experimental 
evidence in miRBase, and which have no experimental 
support in this study (see Section Small RNA Sequenc- 
ing and novel miRNA predictions). In total we obtained 6 
annotation conflicts, within the sequence similarity search 
results. The conflicts were always found to be between a 
miRNA from miRBase and a snoRNA (scaRNA or HACA 
box), or a miRNA from miRBase and SRP RNA from 
Rfam. For an overview of misannotation of miRNAs from 
small RNA-seq data see Langenberger et al. [32]. 

Structure based homology search was performed with 
Infernal [24] on the covariance models of the RNA 
families from Rfam (version 10.1) [7,8]. Here, we only 
used the 695 families with seed sequences in vertebrates. 
Each Rfam family comes with a family dependent gath- 
ering score cutoff, however this alone proved insufficient 
for high confidence annotation since without additional 
filtering the scan of the Rfam families in the pig genome 
with Infernal yielded 709,026 loci, where 8% of them 
had conflicting Rfam annotation. 

In light of the high number of matches presented by 
the unfiltered Infernal results we decided to impose 
additional filters on the structure based annotation. The 
details are found in the Methods section. In the cur- 
rent section, we touch upon the methodology and present 
the results of imposing these additional filters. For an 
overview see also Table 1. 

The miRNA families come with special problems and 
must therefore be treated separately. These problems 
include families specifying a very broad range of miRNAs, 
which makes these families unsuitable for specific annota- 
tion and the Rfam miRNA families also come with a very 
high number of conflict loci without very strict filtering 
(see Methods section for details). For the non-miRNA 
families we in most cases use the Infernal E-value cut- 
offs corresponding to the family specific gathering score 
cutoff level. However, for families where this Infernal 
E-value cutoff would be higher than le-3 we keep it at 
le-3. For a sequence to be matched with high confidence 
to a particular family, we require that the sequence shows 
overlapping Infernal and BLAST matches to the fam- 
ily. For the high confident annotation we set the BLAST 
E-value cutoff to le-3. For the medium confidence anno- 
tation we lower the BLAST overlap requirement to a 
BLAST E-value cutoff of 0.1 and for the low confidence 
annotation we need no overlap in the BLAST compar- 
ison. Even a loose BLAST overlap requirement has the 
effect that it anchors the hit to the right family, thereby 
greatly reducing the number of conflicts and number of 
false family assignments at these levels. 
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Just as the Rfam precursor miRNA families were 
removed from the high confident sequence based anno- 
tation, we also removed them from the high confident 
structure based annotation (high confident homology 
based annotation of miRNAs is thus only performed for 
mlRNA sequences from mlRBase and only with sequence 
based homology search). For the medium and low con- 
fident annotations of Rfam miRNA families, we impose 
Infernal E-value cutoff's of le-9, and le-6, respectively. 
The miRNA precursor families are further required to 
have an overlapping BLAST hit, in both medium and low 
confidence annotations. 

It is worth noticing some pitfalls of using Infernal 
with the Rfam families. First of all, we root out most 
problems by imposing a BLAST filter, by discarding the 
non-miRNA families, and by only looking at families with 
seeds in vertebrates. This causes the loss of a few fam- 
ilies that otherwise would have been detected at lower 
cutoffs. For example, we find an extra 6 non-miRNA fam- 
ilies at the lowest cutoff level. Two of them, VA and 
ACAT, are likely false positives without synteny to their 
origins in the original organisms. The other four, GAIT, 
CoTC_ribozyme, SNORD126 and ACA59, however, are 
syntenic to their human counterparts and could well be 
true matches that we miss because of the filtering. On 
the other hand, some families will still have false posi- 
tives, pseudogenes or remote homologs, and we are left 
with no clear way to distinguish these by automatic means 
alone. We therefore proceed to curate the annotation in 
the coming sections. Hence, the high confident annota- 
tion represent a conservative automated structured RNA 
annotation, without curation of individual RNAs or RNA 
families. 

We found 2,324 high confident loci with structure based 
homology distributed on 317 families (see Additional 
file 1: Table S4 for a summary of the structure homol- 
ogy based results). In contrast to the sequence similarity 
based results, conflicts are marked between overlapping 
Rfam families, rather than overlapping RNA classes. How- 
ever, with the applied cutoffs we found conflicts only in 
the medium and low confident results, while the high 
confident results were free of conflicts. If we compare 
the 533 high confident BLAST results based on the Rfam 
database (see Additional file 1: Table S5) with the high 
confidence results of the Infernal run in Additional 
file 1: Table S4 we found that an additional 194 Rfam fam- 
ilies and an additional 1,791 loci may be annotated with 
high confidence using structure based homology search. 

Class specific methods were used to further enhance 
and extend the annotations of tRNAs (tRNAscan-SE), 
rRNAs (RNAmmer), and snoRNAs (snoStrip). 

We scanned for tRNAs with tRNAscan-SE [10] in 
addition to the pure homology-search methods presented 
above. In the human genome tRNAscan-SE predicts 



a reasonable number of tRNAs (< 1,000) but for other 
mammalian genomes number of tRNAs predicted by 
tRNAscan- SE is often much higher (See, e.g., cow in the 
tRNAscan-SE database on the tRNAscan-SE home- 
page). Here we adopt the cutoffs used for cow in the 
tRNAscan-SE database to gain a set of high confident 
tRNA candidates (See Methods section for details). The 
unfiltered (or low confident) scan generated 32,303 tRNA 
loci compared to the 810 loci of the filtered results (high 
confident). 21 different tRNA types were found within 
the filtered results; tRNA.SeC were found in addition 
among the loci of the unfiltered results. The division of the 
tRNAscan-SE results on types are found in Additional 
file 1: Table S6. 

The combination of tRNAs found by BLAST, 
Infernal and tRNAscan-SE produced 822 high con- 
fident results. However, as part of the cleanup of the 
annotation we changed the 12 tRNAs not detected by 
tRNAscan-SE from high to medium confident anno- 
tation. Thus reducing the number of tRNAs to 810. It 
is noteworthy that 389 of the 810 (48%) high confident 
tRNAs were found to be conserved in at least 5 other 
organisms (see Section Human and pig synteny analysis 
for details), while less than 2% of the 32,320 low confi- 
dent tRNA candidates are conserved in at least 5 other 
organisms. 

The RNAmmer [25] scan generated 182 rRNA loci sub- 
divided on two 28S fragments, four 18S fragments and 
176 8S fragments. The RNAmmer hits were considered to 
be high confident. The results of the RNAmmer scan were 
not directly comparable to the sequence similarity search 
since RNAmmer detects the larger ribosomal fragments, 
which BLAST have difficulties with, unless the hits are 
subsequently clustered. However, the 4 large ribosomal 
RNAs found by medium confident sequence similarity 
search are all included in the RNAmmer results. 

snoStrip [26] generated a total of 173 snoRNA 
families in 513 loci, these are considered high confi- 
dent results. In comparison, Infernal identified 508 
Rfam based snoRNA loci. However, out of the 638 
combined Infernal and snoStrip loci, 130 are 
unique to Infernal and 137 are unique to snoStrip. 
The annotation created by snoStrip will like the 
Inf ernal/Rfam snoRNAs include pseudogenes and 
possibly false positives as well (see the Methods section 
for false positives on shuffled sequence). 

The annotation obtained in this section represents 
the raw annotation, which in the next section will be 
refined. 

Merged annotations of the homology based RNA loci 
The annotations obtained from the individual tools are 
merged in an automated fashion, which will on the one 
hand ignore simple naming differences between the tools. 
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but will on the other hand highlight real annotational 
conflicts. 

We merged the high confident results of the individual 
tools to obtain the high confident annotation of structured 
RNAs (see Table 1 for the applied cutoffs). We found a 
total of 3,391 homology based high confident structured 
RNA loci and structural RNA elements after moving 12 
tRNAs and 15 miRNAs to the medium confident annota- 
tion (Table 2). Of these 139 are cis-regulatory elements, 
58 IncRNA loci (these are typically sub structures of the 
full length IncRNA), 3,183 (full length) ncRNA genes, 
and 11 conflicts of annotation. The distribution of the 
loci on those found by sequence similarity search, struc- 
ture homology search and class specific tools is shown in 
Figure 2(a). A similar table with the homology based tools 
merged at the medium confident level and low confident 
level, respectively, is found in Additional file 1: Table S7. 

The high number of loci found by sequence similarity 
alone (red area in the figure) are primarily miRNAs found 
by similarity to miRNAs in miRBase, but miRNA families 
were excluded from the high confident structure similar- 
ity search. However, 19 loci found by sequence homology 
to sequences from Rfam were unconfirmed by structure 
homology. These are indicated by the red and purple areas 

Table 2 Results of the homology based pipeline 



RNA class High 





Families 


Loci 


cisreg-elements 


31 


139 


IncRNA-loci 


58 


58 


miRNA 


321 


359 


ribozyme 


3 


8 


rRNA 


5 


185 


snoRNA 


211 


538 


snRNA 


10 


1,030 


tRNA 


51 


810 


Other 


7 


153 


Conflict 


9 


11 


Sum 


706 


3,391 



The combined results of the sequence similarity search, structure homology 
search and class specific tools at the high confident cutoff level (See Table 1}. 
The column RNA class contains cisreg-elements: cis-regulatory elements from 
Rfam/Inf ernal; IncRNA-loci: Infernal IncRNA structure loci; the next 7 rows 
contain (full length) ncRNA genes, miRNA: BLAST from miRBase and miRDeep 
predictions; ribozyme: ribozymes from Rfam/Inf ernal; rRNA: ribosomoal 
RNAs primarily from RNAmmer; snRNA and snoRNA: BLAST results and results 
from Infernal/Rfam;tRNA: tRNAs tRNAs from BLAST; tRNAscan-SE and 
Inf ernal/Rfam; IncRNA-loci: structural loci from larger genes(lncRNAs); other: 
RNA families from Rfam not belonging to one of the other classes; conflict: 
conflicts of annotation. Loci are the number of RNA loci of a given class; Families 
are a subdivision of classes into RNAs with the same name. 1 2 tRNAs and 1 5 
miRNAs were moved to the medium confident annotation as part of the 
curation procedure. See text for details. Note that for the final high confident 
annotation we add 1 65 RNA-seq based miRNA candidates, reaching the total of 
3,556 high confident RNA loci. 



(a) BLAST all databases (b) BLAST restricted to Rfam 
BLAST 




Infernal Class specific Infernal Class specific 



Figure 2 Homology based annotation in overview. The Venn 
diagrams for counting the high confident structured RNA loci found 
with sequence homology search (BLAST), structure homology 
search (Infernal) and class specific tools (tRNAscan-SE, 
snoStrip, and RNAmmer). (a) The diagram includes all 3,41 8 high 
confident structured RNA loci obtained by homology. (b)The 
similarity search is confined to the Rfam seed sequences (excluding 
the miRNA families in Rfam). 1 9 loci found by high confident BLAST 
against the Rfam sequences is missed by high confident structure 
homology search (18 -t 1 in the red and purple areas of the right hand 
side of the figure). The reason is our additional Infernal E-value 
cutoff of 1 e-3 imposed on all families. See text for detailed discussion. 
The numbers in parenthesis are after removal of 1 2 tRNAs and 1 5 
miRNAs loci removed in the cleanup procedure. A total of 3,391 RNA 
loci were found. Of these 1,01 1 loci were found by sequence 
similarity, 2,314 were found by structure similarity, and 1,505 were 
detected by class specific methods. 



in Figure 2(b). 15 of these 19 loci belong to two cis- 
regulatory families (SECIS_1 and IRE), 3 are tRNAs and 
1 is a snoRNA belonging to the U3 family. In particular a 
cluster of 10 IRE loci located on chromosome 13 appear 
to be valid. 8 of these were confirmed by sequence sim- 
ilarity alone, however our imposed Infernal E-value 
cutoff of le-3 filter them out, while 2 were confirmed 
by both sequence and structure homology. The three 
tRNAs all fail the E-value filter of le-3, however, two 
of the tRNAs appear to be valid family members, both 
confirmed by low confident tRNAscan-SE. The third 
sequence might be incorrectly entered into the Rfam seeds 
with an Infernal E-value of 3.04e-02. The U3 locus may 
be an incorrectly curated sequence in Rfam with a score 
only half of the gathering score cutoff. 

The substantial change between the two Venn diagrams 
(Figure 2) in the counts of the loci found by class specific 
tools and/or sequence and structure homology is caused 
by tRNAs found by BLAST against tRNAdb, but not found 
by BLAST against the tRNA seeds from the Rfam fam- 
ily. Amongst the 701 loci found by class specific tools 
alone were 115rRNAs (16%), 135 snoRNAs (19%), and 451 
tRNAs (64%). All but 21 rRNAs found by the class spe- 
cific RNAmmer were recaptured by Infernal in the 5S 
rRNA family in the complete Infernal scan without any 
BLAST filtering and using the family specific gathering 
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score as cutoff. However, the gathering score cutoff imply 
an Infernal E-value cutoff of ~1000 for the 5S rRNA 
family. Similarly, 220 of the 451 tRNA candidates found 
by tRNAscan- SE but missed by high confident structure 
homology search were also found in the tRNA Rfam fam- 
ily in the complete Infernal scan. Here the gathering 
score cutoff implied an Infernal E-value cutoff of ^ 
30. Finally, 37 of the 135 snoRNAs found by snoStrip 
were also found in the complete Infernal scan. For 
the 5S ribosomal RNAs, tRNAs and for the snoRNAs 
detected by snoStrip, the difference in the loci detected 
by Infernal and by the class specific tools, indicates 
that sequence based homology search and additional test 
for class specific methods is a more optimal way of search- 
ing for these classes and genes. As a final remark, 1,127 
of the 1,519 (74%) loci found by Infernal, but missed 
by the class specific tools, were marked as pseudogenes in 
the curation procedure (see the following sections). 

Merging the results of the tools at the high confidence 
level introduced two kinds of inconsistencies. Firstly, 
naming differences due to inconsistent naming of the 
RNAs in different databases and by different tools. This 
type of inconsistencies were resolved by consistently pre- 
ferring one tool over the others within each class of RNAs 
(See Additional file 1: Table SI). Secondly, the remain- 
ing inconsistencies will be genuine conflicts that needs 
to be resolved by hand, if possible. The conflicts within 
the high confident results are all between a miRNA and 
a non-miRNA. That is, some miRNA and one of the 
following snoRNAs: SNORA36, SNORA81, SCARNA4, 
SNORD59, SNORA53, SCARNA15, and SNORA18; Mir- 
1285 and Metazoa _SRP; and finally 28S rRNA and 4 
different miRNAs. 

This concludes the homology part of the annotation. 
Overall the homology part of the pipeline contributed 
3,391 high confident structured RNAs (see Figure 2), after 
removal of 12 tRNAs and 15 miRNA loci in the cleanup 
procedure. The next section will deal with the small RNA- 
seq data, which will not only add confidence to the high 
confident miRNAs annotated above, but will also move 
a few medium or low confident miRNAs annotated by 
homology into the high confident annotation. 

Small RNA sequencing and novel mIRNA predictions 

Ten combined small RNA libraries were sequenced and 
scanned for miRNAs using miRDeep (See Methods 
section for details). The read profiles are available for 
download and online visualisation from http://rth.dk/ 
resources/rnannotator/susscrl02/versionl.02. 

The raw output from miRDeep indicated 467 putative 
miRNA loci with at least 10 reads present. However, due 
to duplication of genomic sequence in assembled chro- 
mosomes as well as unplaced scaffolds only 421 unique 
miRNA precursor sequences and 381 mature sequences 



were found. (Note: The 421 miRNA precursors were 
given provisional names, pre-1 ...pre-421, pending cor- 
rect names upon addition to miRBase). The cleanup pro- 
cedure described in the Methods section reduced the 
number of sequences to 388 spanning 431 loci. 

Of the 388 precursor sequences 174 are already known 
for pig in miRBase version 18, a further 69 could be identi- 
fied by high confident sequence identity to miRNAs from 
other organisms in miRBase version 18. Finally 10 precur- 
sors without high confident annotations were identified 
using a combination of low-confident sequence identity 
(90% id), and synteny to other organisms. 

The RNA annotations from the pipeline were checked 
for overlap with the small RNA library. We required a 
block of at least 10 reads overlapping with at least 20 
nucleotides of the annotation and found overlap with 780 
of the high confident annotations and 819 of the low con- 
fident annotations (308 and 329 of those were miRNAs 
at the respective annotation levels). See Additional file 
1: Table S8 for a complete list of read supported high 
confident annotation. 

The small RNA-seq data may have reads from ncR- 
NAs yet to be discovered, beyond the miRNAs found 
by miRDeep. In the next section we present analysis 
with a recently developed tool, which may aid in find- 
ing new ncRNAs based on reads from small RNA-seq 
data. 

Annotation of de novo RNA transcripts using read profiles 

and deepBlockAlign 

Small RNA sequencing data may contain transcripts, 
beyond miRNAs. To detect these de novo RNA tran- 
scripts, the closely spaced (<30 nt) set of mapped 
reads were grouped into distinct so-called block groups 
using blockbuster [33] and the resulting block group 
structures were compared by their read profiles using 
deepBlockAlign [34] to obtain putative RNA annota- 
tions of the unknown transcripts. See Methods section for 
details. 

From blockbuster we found 1,127 block groups with 
a length of at least 50 nt, of which 541 overlapped with 
homology based annotations. Out of the 586 unannotated 
block groups 40 overlapped with miRDeep predicted 
novel miRNAs. By comparing (aligning) the read pro- 
files of the annotated and unannotated RNA transcripts 
using deepBlockAlign we were able to give putative 
annotations to 165 block groups having no high confi- 
dent homology annotation. The results are summarized in 
Table 3. 

deepBlockAl ign and miRDeep share 36 predictions 
and in most (31) of the cases deepBlockAlign agrees 
with the miRNA assignments of the loci, however, 5 pre- 
dictions made by miRDeep had read profiles that accord- 
ing to deepBlockAlign were most similar to snoRNAs 



Anthon etal. BMC Genomics 2014, 15:459 
http://www.biomedcentral.eom/1 471-21 64/1 5/459 



Page 9 of 27 



Table 3 Annotation of unannotated block groups 




mlRDeep-unannotated 


miRDeep-miRNA 


Sum 


dba-unannotated 


417 


4 


421 


dba-miRNA 


46 


31 


77 


dba-rRNA 


1 


0 


1 


dba-snoRNA 


37 


5 


42 


dba-snRNA 


6 


0 


6 


dba-tRNA 


39 


0 


39 


dba-annotated 


129 


36 


165 


Sum 


546 


40 


586 



The transcripts from the small RNA study without annotation by the homology 
based pipeline were analysed with blockbuster resulting in 586 transcripts 
(block-groups). The table displays a comparison of the denovo annotation of 
these block-groups by deepBlockAlign and with miRDeep. The table shows 
a comparison of the deepBlockAlign and miRDeep annotation of the 586 
unannotated block groups. The second column contains the 546 block groups 
not annotated by miRDeep, the third column the 40 block groups annotated by 
miRDeep and the fourth is the sum of the two previous columns, i.e. all 586 
block groups. The rows contain the deepBlockAlign annotations: the second 
row contains the block groups without deepBlockAlign(dba) annotation, 
rows 3-7 contains the deepBlockAlign classifications, row 8 is the sum of 
rows 3-7, and finally row 9 is the sum of rows 2 and 8, that is all 586 block groups 
and depending on the column, their miRDeep annotation. 



and 4 had no matching profiles at the deepBlockAlign 
score cutoff of 0.6 [35]. These nine profiles are shown 
in Additional £Qe 1: Figure SI and Figure S2. In general, 
the miRDeep predictions with read profiles that align 
poorly with read profiles of known miRNAs, either have 
unusually long or short loop regions or have read pro- 
files different from the two peaks normally observed for 
miRNAs. 

Five interesting examples among the unannotated read 
profiles not overlapping with miRDeep predictions can 
be found by requiring both a deepBlockAlign clas- 
sification as well as an overlap with a structurally 
conserved region as predicted by RNAz. These five exam- 
ples are displayed in Additional file 1: Figure S3-S5. 
And amongst these examples we find 2 miRNAs that are 
missed by both high confident BLAST and miRDeep, 
but deepBlockAlign identifies the read profile as 
miRNA like, and low confident BLAST identifies the 
miRNAs as mir-223 and mir-431. One further miRNA, 
mir-1388, is identified by low confident BLAST and by 
deepBlockAlign, but is not overlapping with an RNAz 
loci (Additional file 1: Figure S6). 

The examples presented above show how deep 
BlockAlign present an experimentalist with extra 
information on top of conventional analysis of small RNA- 
seq data. Both the 10 putative miRNAs with uncertain 
miRNA profiles, and the 3 miRNAs missed by both 
BLAST and miRDeep require additional experimental 
scrutiny, which, however, is outside the scope of this 
work. 



A multiple alignment of pig and 20 other vertebrate 
genomes 

This and the following sections are concerned with adding 
context to the annotation provided above. We begin with 
a synteny analysis of the pig genome and of the structured 
RNA annotation. 

To investigate synteny between pig and 20 other verte- 
brate genomes, we formed LASTZ/UCSC [29,30] pairwise 
alignments, which were subsequently cleaned for non- 
syntenic alignments. The resulting coverage of the pig 
genome ranged from around 1.7% for zebrafish to 64% 
for horse measured on the coverage of the non-Ns of 
the pig genome (the N's currently account for 10% of 
the 2.8 Giga bases in the assembly). A multiple align- 
ment was formed from the pairwise alignments with the 
Threaded-Blockset-Aligner (TBA/MultiZ) [36] 
using the phylogenetic tree presented in Figure 3 as guide. 
This pig-centered multiple alignment featured a coverage 
of 79% of the pig genome (Table 4 and Figure 4). Note that 
the combined pipeline of TBA/Mult iZ and LASTZ is also 
known as multiz. 

The coverage of the pig genome featured in Table 4 and 
Figure 4 is a less than ideal measure of the phylogenetic 
distance between the pig (target) and the other species 
(query). Ideally the distance is measured on the sequence 
change within a set of known and carefully selected genes. 
The coverage is influenced not only by varying rate of 
sequence change depending on the genomic loci but also 
on the different alignment parameters used for different 
genomes (that is, the more sensitive parameters used for 
the more distant genomes will result in a artificially high 
coverage), on the size of the completed genomes, and also 
on the degree of completion of the genomes. 

The multiple alignment obtained in this section will be 
further used to find structurally conserved RNAs using 
RNAz. An application of the pairwise alignments is the 
synteny analysis of the structured RNA annotation in pig. 
However, to fully utilize the alignments we need a good 
annotation of structured RNAs in the query organism. 
The best annotated organism among the vertebrates is 
human and we therefore proceed with a more detailed 
analysis of the synteny between human and pig. 

Human and pig synteny analysis 

We performed a detailed synteny analysis of the relation- 
ship between the pig and human genomes based on the 
pairwise alignment between the two. The analysis lead to 
blocks at least 5,000 nt long in which the synteny and 
strand information of the pairwise alignments was unbro- 
ken. These blocks, referred to as synteny blocks, were 
broken at gaps larger than 5,000nt. We identified 40,781 
such synteny blocks between human and pig, which cov- 
ered 79% of the non-Ns of the assembled pig chromo- 
somes. The syntenic blocks contain gaps not covered by 
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susScrl02(pig) 
turTrul(dolphin) 
bosTau5(cow) 
equCab2(horse) 
felCat4(cat) 
canFann2(dog) 
eriEurl(hedgehog) 
hgl9(human) 
tarSyrl(tarsier) 
mm9(mouse) 
m4(rat) 

oryCun2(rabbit) 
loxAfr3(elephant) 
echTell(tenrec) 
dasNov2(armadillo) 
choHofl(sloth) 
monDom5{opossum) 
ornAnal(platypus) 
galGal3(chicken) 
xenTro2(frog) 
danRer7(zebraflsh) 

Figure 3 Phyologentic tree. Phylogenetic tree for the pig genome multiple alignments. This phylogenetic tree is derived from the human 
phylogenetic tree from UCSC based on the 45-way alignment. The organisms has been reordered to put pig on top and the branch lengths have 
been ignored. A tree in this form is needed as parameter for the TBA/MultiZ program. The tree is without branch lengths since these have not 
been recalculated for the pig genome. 




alignments, and the coverage is therefore larger than the 
raw coverage of 56% of the pig genome by the human 
genome as reported in Table 4. 3,242(97%) of the 3,556 
high confident annotations were found on the assembled 
chromosomes and of these 78% are in the human syntenic 
blocks (See Table 5). 

When we merged neighbouring synteny blocks with 
identical chromosome and strand information, we found 
that the number of blocks could only be reduced to 13,302. 
If we instead ignored the strand information and merged 
neighbouring synteny blocks based only on a require- 
ment of identical chromosome information, the number 
of blocks could be reduced to 218 synteny regions between 
human and pig, which cover 99% of the pig assembly (note 
that we allow large gaps, inversions and even to some 
extend rearrangements within these regions). The num- 
ber of synteny regions found here, may be compared to 
the 51 synteny groups and 173 conserved segments found 
in the experimentally based human pig comparative map 
in reference [37]. 

When we did a similar analysis for the horse/human 
pairwise alignment we found the initial 27,219 initial syn- 
teny blocks could be reduced to 1,786 synteny regions 
when we ignored the gaps but respected the strand, and 
finally to 203 synteny blocks when gaps and strand were 
both ignored. These cover 99% of the horse genome. The 



difference lies in the assembly of the two genomes. The 
horse contigs are ordered and oriented correctly using 
a combination of probe sequences and positional infor- 
mation from the comprehensive radiation hybrid and 
FISH maps [38]. Notice that the 218 regions represent 
a pig-centered view of the pig-human synteny regions. 
The corresponding regions in human may contain (large) 
gaps and in certain cases overlap. A cleaned up ver- 
sion consisting of 377 smaller regions is found on the 
webpage (http://rth.dk/resources/rnannotator/susscrl02/ 
versionl.02/regions.html). In this version large gaps in 
the human regions are ignored, but in some cases small 
regions will overlap with large regions when viewed in the 
human coordinates. 

The analysis of the synteny between human and pig 
highlights some issues in the current pig assembly, which 
will be touched further upon in the following sections. 
These issues however, only makes a detailed synteny anal- 
ysis more desirable, since it enables the identification of 
some of these issues. 

Annotations having duplicated sequence 

A particular issue in the current assembly is dupli- 
cated sequence, both within the assembled chromosomes, 
and between sequence on the assembled and unplaced 
scaffolds. 



Anthon etal. BMC Genomics 2014, 15:459 
http://www.biomedcentral.eom/1 471-21 64/1 5/459 



Page 11 of 27 



Table 4 Pairwise alignments 



Genome 


Species 


LASTZ/chaining options 


Coverage 


Alignment type 


danRer7 


Zebrafish 


Distant 


1.69 


rbest 


xenTro2 


Frog 


Distant 


1.94 


rbest 


galGal3 


Chicken 


Distant 


3.61 


rbest 


omAnal 


Platypus 


Distant 


5.97 


rbest 


monDomS 


Opossum 


Distant 


10.70 


syntenic 


eriEurl 


Hedgehog 


Close 


19.08 


rbest 


echTell 


Tenrec 


Close 


20.43 


rbest 


rn4 


Rat 


Close 


25.76 


syntenic 


mm9 


Mouse 


Close 


27.71 


syntenic 


dasNov2 


Armadillo 


Close 


30.10 


rbest 


choHofI 


Sloth 


Close 


30.94 


rbest 


tarSyrl 


Tarsier 


Close 


36.80 


rbest 


oryCun2 


Rabbit 


Close 


42.11 


syntenic 


feICat4 


Cat 


Close 


43.42 


rbest 


loxAfrS 


Elephant 


Close 


47.52 


syntenic 


turTrul 


Dolphin 


Close 


53.93 


rbest 


hg19 


Human 


Close 


55.93 


syntenic 


bosTauS 


Cow 


Close 


58.13 


syntenic 


canFam2 


Dog 


Close 


58.83 


syntenic 


equCab2 


Horse 


Close 


63.90 


syntenic 


21 way 


Multiple alignment 






78.98 



Pairwise and multiple alignments of the pig genome. The other genomes were obtained from the UCSC genome browser website in their lower-case masked form. 
IVlasking was performed by the UCSC with RepeatMasker and Tandem repeat masker. The UCSC genome designation is given in the first column. The options 
for the pairwise alignments are given in the third column as either closely or distantly related to the pig in accordance with the choices made by UCSC for the human 
genome. The distance to pig has implications for the LASTZ and axt Chain options as listed in Additional file 1 : Table SI 9. The coverage of the pig genome is given 
in % in the fourth column based on the number of non-Ns covered by the pairwise alignment after cleaning of the alignments as specified. The pairwise alignments 
are cleaned either by synteny {small alignment chains are deleted when they would otherwise break synteny) or by deleting all but the best alignments where the 
target genome is multiply-covered. Both methods reduces the coverage of the pig genome by the alignment. The choice of cleanup method is given in the last 
column. A graphical representation of the coverage is given in Figure 4. 



We found that annotations expected only once in a 
mammalian genome were found in more than one place 
in the pig genome with the exact same sequence. Some- 
times these duplicated annotations were found twice on 
the same chromosome in close proximity (for example 
mir-196b twice on the same strand, and both mir-615 
and mir-194 twice, once on each strand) and sometimes 
they were found both on the assembled chromosomes and 
within the unplaced contigs (for example mir-127, mir- 
155). Furthermore, sequence from the unplaced scaffolds 
are in some cases largely identical to sequence placed on 
the assembled chromosomes. To investigate this problem 
we formed a pairwise alignment of the pig genome with 
itself using Blat [30] requiring high sequence identity 
(>= 98%) and allowing no introns in the alignments. The 
result of the alignment of the assembled chromosomes 
against unplaced scaffolds were that 63% of the sequence 
of the unplaced scaffolds were aligned to sequence on 
the chromosomes. The self alignment also allowed the 
identification of a number of annotations that were dupli- 
cated in the pig genome. Some of these are known to 



have many copies in mammalian genomes, like 7SK, 
8S rRNA, GP_knotl, Histone-3-prime-UTR, SNORA70, 
SNORD116, tRNAs, Ul, and U6. However, others like 
the miRNAs listed above were not expected to be found 
in multiple copies with high sequence identity. A mod- 
ule was therefore added to the pipeline which identifies 
annotations with identical sequences. We found that 10% 
high confident miRNA genes had exact duplicates in the 
assembled chromosome or unplaced scaffolds. A full list 
of duplicated annotations is found in Additional file 1: 
Table S9. Note that, all copies of duplicated genes are used 
in the remaining analysis as it raises problems with the 
statistics when the genomic context is different between 
two copies. For example one copy may be conserved in 
another organism, while the other is not. 

General synteny analysis of the pig annotation 

To further enhance the annotation we performed a gen- 
eral analysis of the conservation of the structured RNA 
loci. The full benefit of this analysis is only obtained when 
we have a good annotation of structured RNAs in the 
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Figure 4 Coverage of the pig genome. Absolute coverage of the pig genome by the genomes used for the multiple alignment. The coverage is 
based on the cleaned alignments with single best coverage of the pig genome. However, depending on the method of cleaning the coverage of 
the target genomes (x-axis) may be multiple in some locations. Far left, in green, is the result of the coverage of the pig genome by any genome in 
the multiple alignments. 



query organism. Special emphasis is therefore put on the 
pig/human conservation of the annotation. 

We analyzed the synteny of the structured RNA anno- 
tation using the pairwise alignments is to check the cover- 
age of a given annotation by transferring the coordinates 
to the target species using the pairwise alignment 
(liftOver). However, since the alignments are single 



best coverage of pig, but may be multiple coverage of the 
target genome in some region, the annotation will in some 
cases be matched with a paralogous sequence in the tar- 
get genome. This problem is to some degree mitigated 
by using the cleaned-up pairwise alignments described 
above, which should reduce the coverage of paralogous 
alignments. 



Table 5 Synteny of the RNAs of the homology based pipeline 

hgl 9 RNAs conserved in N other organisms 
RNA class #loci = 







Syntenic 


Conserved 


Both 


1 


5 


15 


cisreg-elements 


139 


80 


84 


65 


116 


86 


31 


IncRNA-loci 


58 


57 


53 


53 


58 


57 


7 


mlRNA 


369 


303 


349 


292 


360 


349 


102 


putative-miRNA 


155 


121 


25 


20 


65 


25 


1 


ribozyme 


8 


8 


3 


3 


3 


3 


0 


rRNA 


185 


143 


0 


0 


6 


1 


0 


snoRNA 


638 


473 


266 


221 


400 


282 


43 


snRNA 


1,030 


674 


24 


13 


119 


26 


4 


tRNA 


810 


549 


274 


199 


389 


284 


14 


other 


153 


111 


10 


10 


17 


11 


0 


Conflict 


1 1 


10 


10 


10 


10 


10 


4 


Sum 


3,556 


2,529 


1,098 


886 


1,543 


1,134 


206 



The columns are, RNA class, # RNA loci. # loci in human syntenic blocks, # loci conserved in human by 80% sequence identity. # loci both syntenic and conserved that is the 
number of ncRNAs in syntenic blocks where the ncRNA is actually conserved in human. ^ loci RNAs conserved in N other organisms, grouped by number of loci 
conserved in at least 1, 5, or 1 5 other organisms. Conservation is determined by the sequence identity in the pairwise alignments. The RNA loci are located in the 
pairwise alignments and the sequence identity is calculated when at least 80% of an RNA locus is covered. The RNA locus is counted as conserved in that organism if 
the locus has a sequence identity of at least 80%. In the table the number of RNAs conserved in at least N (N=1; N=5 or N=15) of the other genomes: bosTauS, 
canFam2, choHofI, danRer7, dasNov2, echTell, equCab2, eriEur1,felCat4, galGalS, hgl 9, loxAfr3, mm9, monDomS, ornAnal, rn4, oryCun2, tarSyrI, turTrul, xenTro2. 
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To annotate the RNAs in the non-pig organisms, each 
annotation is located in the pairwise alignment and the 
per-base identity is calculated. In Table 5 we list the loci 
that are conserved in human (left side of the table) and 
in depth in the phylogenetic tree (right side of the table). 
We have chosen 80% sequence identity as a cutoff for con- 
servation between pig and other organisms. This choice is 
based on our experiences with the conservation of miR- 
NAs between pig and human (see Section miRNAs in the 
pig genome for details). In general, deciding on such a cut- 
off is difficult if not impossible, and should depend on the 
class of the RNA annotation as well as the phylogenetic 
distance between the two organisms. 

Genie context of the annotations 

Another important characteristic of the ncRNA annota- 
tions is their contexts of protein coding genes. We there- 
fore marked the annotations with genie context as either 
UTR, coding-exonic or intronic using the coding genes 
from Ensembl version 68. Annotations overlapping a cod- 
ing gene were marked as having a) coding-exon context if 
at least 50% of the annotation overlapped with a coding 
exon or b) 5' or 3' UTR context if the annotation over- 
lapped with a non-coding exon. If an exon was marked 
as both coding and non-coding for different transcripts, 
the coding-exonic context was preferred. When an anno- 
tation is close to more than one protein coding gene all 
genie context designations are kept. We also look for pro- 
teins coding genes 10,000 nucleotides up or down stream 
of the annotation (similar cutoff as the miRNA genie con- 
texts used by miRBase). A summary of the genie contexts 
for the different RNA classes are shown in Additional file 
1: Table SIO. 

Note that the protein annotation of the pig genome is 
at an early stage and is further made difficult by incor- 
rect ordering and strand assignment of the contigs of the 
pig genome hinted at in the section about the human/pig 
pairwise alignment. Also the UTRs of are not always 
well-defined or even known. Therefore the assignment of 
context as coding-exonic, UTR, or intronic will in some 
cases be immature or even incorrect. 

We will touch upon the genie context again when ana- 
lyzing conservation of contexts for particular classes of 
structured RNAs, e.g. cis-regulatory elements and snoR- 
NAs. 

Clustering of the annotations 

Certain structured RNAs are known to form clusters 
based on genomic position. Examples include, miRNA 
clusters, snoRNAs spliced from itrons, clusters of tRNAs 
and clusters of cis-regulatory elements, e.g. iron response 
elements (IRE). 

All pairs of high confident annotations less than 10,000 
nt apart on the same strand, were used to form clusters 



by single linkage. The distribution of the cluster size is 
shown in Additional file 1: Figure S7. The cutoff of 10,000 
was chosen in accordance with the cutoff used by miR- 
Base for the clustering of miRNAs. We found 261 clusters 
of which 10 contained at least three miRNAs. A number of 
large snoRNA clusters were also observed. Detailed exam- 
ples of the clusters are given in later sections for miRNAs, 
and snoRNAs. In Additional file 1: Table Sll the num- 
ber of RNAs in different classes and different clustering 
distances is shown. 

De novo prediction of structurally conserved RNAs 

When novel structural RNAs in e.g., UTRs are predicted 
by computational methods or by experiments, structural 
conservation may help the experimentalist in asserting 
their validity and phylogenetic domain. We therefore pre- 
dicted de novo structured RNAs in the pig genome using 
RNAz and the MAE blocks of the multiple alignment of 
pig against 20 other vertebrate genomes. Details are in the 
methods section, including a discussion of false positive 
rates of RNAz. 

We found 95,106 strand specific structurally conserved 
loci with a p-score cutoff of 0.9, which could be merged 
to 83,869 conserved loci when the strand prediction of 
RNAz was ignored. Additional file 1: Table S12 shows the 
overall statistics of the RNAz hits. The RNAz predictions 
were assigned genie contexts according to the same cri- 
teria as for the homology based annotations. Most of the 
RNAz predictions were found in intergenic regions (59%) 
or were intronic (23%); however a small number are found 
to be overlapping with exons of protein coding genes (1%) 
or near the beginning or the end of protein coding tran- 
scripts (5%) and (6%) respectively, 7% lie in the context of 
multiple genes, and their gene context is undecided. 

The annotations of the pipeline were marked as struc- 
turally conserved, if there was an overlap of at least 20 
nt to an RNAz loci irrespective of strand. The results are 
show in Table 6. 538 of the high confident annotations 
were found to overlap with RNAz loci and as expected the 
highly structured miRNA has the highest recovery rate in 
the RNAz results. It is, however, somewhat lower than one 
might have expected when Table 6 is compared to Table 5, 
which shows the statistics of RNAs conserved in genomic 
location. Eor example, based on the pairwise alignments 
90% of the high confident miRNA candidates obtained by 
sequence homology are conserved in genomic position in 
at least 6 organisms including pig, which is included in the 
pig-centered alignments by default. However, RNAz only 
predicts 62% to be structurally conserved. This is not nec- 
essarily a flaw in RNAz, but more likely in the multiple 
alignment MAP blocks. For example, mir-30e is found to 
be conserved in 17 of the 21 species but is not confirmed 
with RNAz. Examination of the sequences shows that 
mir-30e is partially sequenced in the tenrec genome. 
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Table 6 Overlap of the RNAz predicted de novo with the 
high confident annotation 

Annotation RNAz overlap 

RNA class — 





Families 


Loci 


Families 


Loci 


cisreg-elements 


31 


139 


4 


10 


IncRNA-loci 


58 


58 


3 


3 


miRNA 


330 


369 


222 


241 


putative-miRNA 


135 


155 


14 


16 


ribozyme 


3 


8 


0 


0 


rRNA 


5 


185 


2 


2 


snoRNA 


211 


638 


57 


72 


snRNA 


10 


1,030 


9 


20 


tRNA 


51 


810 


35 


154 


other 


7 


153 


3 


4 


conflict 


9 


11 


4 


6 


sum 


850 


3,556 


354 


528 



Comparison of the strand specific RNAz results with the result of the automatic 
annotation pipeline. The columns are, RNA class, # RAM families In the high 
confident annotation, # RNA loci in the high confident annotation, # RAM families 
that overlap with the rnaz predictions, # RAM loci that overlap with the RNAz 
predictions. 



which breaks the multiple alignment MAF block and 

thereby prohibits the RNAz prediction. Repairing the 
MAF blocks on a genome wide scale would in principle be 
possible. It is however, not straightforward and is outside 
of the scope of the present study. 

As a final remark, it is noteworthy that while around 
65% of the miRNAs obtained by similarity search are over- 
lapping with RNAz loci, the same is only true for around 
10% of the novel miRNAs. 16% are conserved in 6 organ- 
isms and 10 or 6% appear to be conserved in human 
and may thus represent miRNAs in human yet to be 
discovered. 

Lineage specific structured RNA loci 

The lineage specific structured RNAs are important for 
the pig, both for the understanding of pig specific traits 
and when considering the pig as model organism as these 
can help pointing towards potential concerns in some 
pathway modelling.. We therefore investigated the lineage 
specificity of structured RNAs inside and outside of the 
Laurasiatheria (which contains pig, cow, dolphin, horse, 
cat, dog, hedgehog amongst the genomes investigated in 
the pair wise alignments). 

The bulk part of the structured RNAs obtained here 
were found by homology search, and will therefore not be 
pig specific. However, the de novo RNAs may be lineage 
specific and both homology and de novo based annota- 
tions may be found in lineage specific loci, that is loci 
preserved only in a branch of the phylogenetic tree. Our 
search will therefore initially focus upon lineage specific 
regions in the pig genome. 



In general, determining lineage specific RNAs is non- 
trivial since it require clear annotation in one or several 
closely related organisms, however, at some cutoff the 
annotation is not to be found in organisms outside the 
organism or the closely related organisms. Furthermore, 
determining the lineage specific annotations might be 
impacted by the lack of genomic coverage and the qual- 
ity of the genomic sequences, as it is the case between 
human and Neanderthal [39]. The search for lineage spe- 
cific structured RNA is further complicated by the fact 
that many structured RNAs can hold the same struc- 
ture while being highly divergent in sequence. Structural 
(re-)alignment of genomic sequence is in these cases 
required to elucidate the structural conservation e.g., 
[40-44]. Hence, we employ a conservative strategy where 
the sequences within the Laurasiatherian branch should 
exhibit relatively high sequence identity (>60%) between 
all pairs of organisms considered while at the same time 
have relatively low sequence identity (<30%) to all organ- 
isms outside the branch. 

To find lineage specific structured RNAs, we calculated 
the sequence identity of the obtained structured RNAs 
between pig and the organism for which we have pair- 
wise alignments, preferably, using the synteny cleaned 
pairwise alignments where available. The RNAz results 
are particularly well suited for this analysis, since struc- 
tural conservation in at least 3 organisms including pig 
is required to yield a result in the first place. In concor- 
dance with our strategy, we need to find a lower cutoff in 
sequence identity for the loci to be not considered with 
the organisms in question. 

In Additional file 1: Figure S8 we plotted the number 
of RNAz loci grouped by a given identity cutoff outside 
the Laurasiatherian branch (these are the lineage specific 
candidates). Up until 30% sequence identity cutoff the 
number of grouped loci remains fairly constant (and rel- 
atively small), but from around 50% quickly increase. We 
therefore choose a conservative sequence identity cutoff 
of 30% outside the lineage, which corresponds to around 
1.5% of the RNAz loci. 

In Additional file 1: Figure S9 we plot the number 
of RNAz loci that have at most 30% sequence identity 
to pig outside the lineage versus the minimum pairwise 
sequence identity between pig and the other species found 
in the pairwise alignment. The number quickly drops 
above 60% sequence identity and we therefore decide on 
a cutoff of 60% and reach a final number of 1,004 lineage 
specific loci containing RNAz predictions. 

The high confident annotation was investigated for lin- 
eage specificity by the same criteria, however, only the 
annotations with at most 5 occurrences in the genome 
were investigated, as highly repetitive sequences are often 
skipped in the alignments. Furthermore, annotations 
with exact duplicates in the genome and annotations 
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found in the unplaced scaffolds were excluded as well. 
With these criteria, we found 23 lineage or pig spe- 
cific structured RNA loci on the assembled chromosomes 
within the high confident annotation shown in Addi- 
tional file 1: Table S13. These are miRNAs mir-1949 
and mir-2320, four de novo miRNAs, 16 snoRNAs and 
one vault RNA. Only four of these are conserved in 
other species within the lineage, three HACA box snoR- 
NAs (SNORA22, SNORA23, SNORA81) and one known 
miRNA (mir-2320), specific for the cow/pig/dolphin part 
of the tree. The vault RNA found on chromosome X 
is likely a pseudogene since it lacks a polll promoter 
sequence. It does, however, score very high according to 
Infernal (69 compared to 82 for the real gene on chro- 
mosome 2). For further discussion about the vault RNAs 
see the Section Curation based on the detection of PolII 
and PolIII promoter sequences. One miRNA, mir-1949, is 
previously known only for rodents, and could also be a 
mis-annotation induced by the otherwise strict filtering of 
the BLAST results. 

Lineage specific ncRNAs is still a relatively unexplored 
area, which is expected to grow as more analysis is per- 
formed on new genomes and organisms. MiRNAs are one 
class of ncRNAs where lineage specific ncRNAs have been 
discovered in pig, but more classes are expected to fol- 
low when the tools for novel ncRNA detection in small 
RNA-seq data are developed further. 

miRNAs in the pig genome 

In this and the following sections we discuss the curation 
of the structured RNA annotation utilizing the different 
types of contexts added to the annotation in the preced- 
ing sections. We begin with the miRNAs in pig and the 
synteny between pig and human miRNAs, which in con- 
clusion will allow us to curate the 125 miRNA genes that 
were found to be part of miRNA clusters common to 
human and pig. 

The high confident homology pipeline provided 359 
miRNA loci and the miRDeep pipeline adds an addi- 
tional 165 de novo miRNA loci for a total of 524 miRNA 
loci. 10 of the miRDeep loci where renamed and reclas- 
sified by homology to human or by low confident (90% 
id) sequence identity to sequences from miRBase. See also 
Additional file 2. 

MiRNA clusters are known for a number of organ- 
isms. In the analysis of the pig genome we observed 58 
miRNA clusters, i.e. multiple miRNA loci within 10,000 
nucleotides of each other, all on the same strand. In cer- 
tain cases, miRNAs in clusters are found like pearls on 
a string. In the pig genome we found 10 miRNA clus- 
ters with at least three miRNAs. 6 had a cluster size of 
up to around 300 nucleotides per miRNA, while the last 
4 had sizes of approximately 1,000-2,000 nucleotides per 
miRNA. 



In comparison to the 58 pig miRNA clusters, a total of 89 
of strand specific miRNA clusters are known in the human 
genome according to miRBase version 19. In the follow- 
ing, we have analyzed how many of these miRNA clusters 
were conserved in human using a combination of BLAST, 
and the genome scale pairwise alignments between the 
two organisms. 

MiRNAs are often well conserved between mammalian 
genomes, however, a simple BLAST of the miRNAs will 
either include too many false positives or occlude the 
more distant homologs. In Additional file 1: Figure SIO we 
have plotted the conservation in pig of the 1,595 human 
miRNA loci from miRBase 19 using the genome scale 
human-pig pairwise alignment. From the figure we con- 
clude that there is no clear indication of a identity cutoff 
where a miRNA may be said to be conserved between 
human and pig. However, if we restrict the plot to the 
miRNAs from human miRNA clusters (Additional file 1: 
Figure Sll) a steep drop-off in the number of conserved 
miRNAs is observed at around 80-90% identity. Through- 
out the paper, we will use a cutoff of 80% sequence 
identity in the pairwise alignments to accept conserva- 
tion between pig and human. This results in a total of 168 
miRNAs from the human miRNA clusters that are con- 
sidered conserved in pig. These form 51 clusters in pig, 
which again contains 151 miRNAs, i.e., 17 miRNAs are 
conserved, but no longer a part of a miRNA cluster in pig 
at this cutoff level. 

The 13 miRNA clusters from human with at least 4 
miRNAs are shown in Additional file 1: Table S14. In the 
following, we discuss the conservation of these clusters in 
pig- 

The mir-512 cluster is known to be primate specific 
according to miRBase, and we see no evidence to the con- 
trary. The mir-379 cluster is incomplete in pig (Additional 
file 1: Figure S12), where it is found in an unplaced scaffold 
in pig missing part of the sequence. A similar issue is seen 
with the mir-450 cluster. The mir-493 cluster appear to be 
broken in pig. Most of the miRNAs are found in a clus- 
ter on chromosome 7, however they are located on both 
strands (Additional file 1: Figure S13). The cow, dolphin 
and horse genomes, which are phylogentically close to the 
pig genome, all harbors near complete mir-493 clusters, so 
this is likely an assembly issue in pig. 

A more difficult case is the mir-532 cluster located on 
chrX in human (Additional file 1: Figure S14), which con- 
tains 8 known miRNAs from miRBase: mir-532, mir-188, 
mir-500a, mir-362, mir-501, mir-500b, mir-660, mir-502. 
In pig we found a cluster on chrX of mir-532, mir- 
188, mir-500, mir-362, mir-500, mir-660, and a mature 
unknown miRNA picked up by miRDeep. Investigation of 
the human-pig alignment revealed that mir-501 is poorly 
conserved with a pairwise identity of 57%, while the 
unknown miRNA picked up by miRDeep is mir-502 with 
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a pairwise identity to human of 87%. Mir-501 is also poorly 
conserved in cow (65%), but somewhat better in horse 
(78%). 

The mir-17 and mir-363 clusters are well conserved 
and supported by reads (Additional file 1: Figure S15 and 
SI 6). In the mir-367 cluster only 3 of the 5 miRNAs are 
identified by high confident BLAST, however, the cluster 
appears complete in the pairwise alignment (Additional 
file 1: Figure S17). 

The mir-513a and mir-514b clusters are incomplete in 
the pig genome (See Additional file 1: Figure S18), lead- 
ing to several problems in the pairwise alignments. Three 
miRNAs are observed in the miRDeep analysis of these 
clusters, which may be assigned as mir-506, mir-508 and 
likely mir-509. 

The mir-450b cluster is incomplete in the assembly, but 
the miRNAs that we find are supported by BLAST and 
reads (Additional file 1: Figure S19). mir-3601 found on 
the opposite strand of the cluster is obtained by sequence 
similarity from cow. 

The cluster containing mir-892c appear at first glance to 
be missing in pig. However, the six miRNAs in the mir- 
892 cluster on chrX in human, are conserved in a cluster 
on chrX in pig (See Additional file 1: Figure S20). The pair- 
wise identities between the human and pig miRNAs in this 
cluster are in the 55-70% range and are thus under the 
cutoff chosen above. The mir-891a/b are two different loci 
in human but only one of them is observed in the pairwise 
alignments in the pig. The cluster is supported in the pig 
by 6 de novo structured RNA predictions by RNAz. 

In conclusion, we include the 125 miRNA loci in the 
curation annotation, for which we have high confident 
annotation and which are are part of miRNA clusters in 
both pig and human. 

snoRNAs in the pig genome 

The conservation and curation of snoRNAs in pig are dis- 
cussed in this section. Using synteny between pig and 
human we curated 268 or 42% of the snoRNAs loci in pig. 

We annotated 645 snoRNA loci (including 7 conflicting 
loci) in the pig genome using a combination of BLAST, 
snoStrip and Infernal on snoRNA families from 
Rfam. 

Like miRNAs, snoRNAs are often found in clus- 
ters and we identified 73 clusters containing multiple 
snoRNAs within a distance 10,000 nt. In two particu- 
lar cases large snoRNA clusters identify snoRNA host 
genes in pig, which are at present missing from the 
Ensembl annotation. In human these two transcripts, 
called SNHGl and GAS5, are known to harbour a 
number of snoRNA. The human version of SNHGl 
harbours SNORD25, SNORD26, SNORD27, SNORD28, 
SNORD22, SNORD29, SNORD30, SNORD31, and the 
same genes in the same order are observed in pig. GAS5, 



similarly displays exactly the same genes in human and 
pig. The 23 snoRNAs contained in these two host genes 
were curated based on this information. 

The snoRNA database contains a set of curated 
human snoRNAs. Using the syntenic pairwise alignments 
between human and pig, we manually inspected the auto 
generated snoRNA annotations. We obtained the human 
coordinates for the pig snoRNAs and calculated the cov- 
erage using the methods explained in the synteny section 
above. We filtered the results by 60% sequence iden- 
tity between human and pig and considered the coordi- 
nates of the curated human snoRNAs from the snoRNA 
database. In total we curated 268 loci, corresponding to 
189 snoRNA families. The 278 non-curated members of 
these families were marked as pseudogenes. While the 
cutoff for sequence identity is 60%, all curated mem- 
bers were found to have a sequence identity above 70% 
between pig and human. It is noteworthy that only 64 
(23%) of the curated snoRNAs are found with high confi- 
dent BLAST. 

This kind of synteny analysis limit ourselves to known 

genes in other organisms, here human. As an example we 
find 8 copies of SNORD14 in pig. A closer inspection, 
however, reveals that 3 of them have exact copies close to 
each other. These copies are likely to be the result of incor- 
rect assembly. We thus find two SNORD14 clusters in 
pig, one with two members and one with three members, 
all scoring about equally well with Infernal. In human 
we find only two curated copies in the snoRNA database, 
which we are able to match and curate in pig using syn- 
teny. However, a full Infernal search reveals a total of 5 
copies in human all scoring about equally well. And these 
form two clusters with 2 and 3 members, respectively, 
each cluster syntenic to those in pig. 

In summary, we curated 42% of the snoRNAs in pig. 
However, it is expected that more snoRNA genes may be 
curated as the snoRNA discovery methods improve. 

Ribosomal RNA clusters in the pig genome 
The ribosomal RNAs (rRNAs) are known to have a num- 
ber of pseudogenes in the mammalian genomes, often 
only as part of the full length rRNA genes. For the rRNAs 
we only found one full length copy of the rRNA clus- 
ter. On chromosome 6 we found 28S, 18S, and 5.8S 
clustered closely together, and the finding is further con- 
firmed by the existence of a pRNA locus upstream of the 
28S locus on chromosome 6. Ribosomal RNAs part of 
this cluster are curated, all other rRNAs are marked as 
pseudogenes. 

Cis-regulatory elements in the pig genome 

The cis-regulatory elements are regions of the RNA (and 
DNA) that regulate genes on the same chromosome, often 
on the same transcript or located relatively close up- or 
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down-stream. Here, we only annotate those cis-regulatory 
elements that have RNA structure as extracted from Rfam. 
The genie context of the annotated cis-regulatory ele- 
ments allowed us to curate 66% of them. For the remaining 
34% the genie context could not be ascertained and the cis 
regulatory elements were marked as pseudogenes. 

Normally close proximity between the cis-regulatory 
element and the gene is observed. Accordingly, the genie 
context of the cis-regulatory elements are of particular 
interest. As expected, we found that most of the anno- 
tated cis-regulatory elements are close to or overlapping 
with protein coding genes. Only 31 of the total 139 cis- 
regulatory elements are more than 10,000 nucleotides 
away from a protein coding gene, while 12 are between 
1,000 and 10,000 nucleotides away. Of these 12 cis- 
regulatory elements 9 belong to the ubiquitous GP_Knotl 
family. 1,000 nucleotides is therefore chosen as the cutoff 
for the curation procedure of the cis-regulatory elements. 
Of the 139 annotated cis-regulatory elements, 20 cis- 
regulatory elements are found to be antisense to protein 
coding genes. However closer inspection reveal all but 
two to be sense to a protein coding gene as well. Many 
of the intergenic cis-regulatory elements are found to be 
from highly repetitive families like GP_Knotl and IRE, 
while others are very confident and highly specific like e.g., 
SECIS-1 and IRES-c-sis. Only detailed analysis can reveal 
if these are true positives where the corresponding protein 
coding gene is undetected by for what ever reason. 

In all 93 (63%) cis-regulatory-element loci were curated 
based on having sense genie context within 1,000 
nucleotides. 31 cis-regulatory elements more than 10,00 
nucleotides away from a protein were marked as pseudo- 
genes. 

As a final remark, the annotation of protein coding 
genes in the pig genome is at an early stage and is further 
complicated by the incorrect ordering and strand of the 
pig genome contigs. Therefore, we are unable to curate 
the cis-regulatory elements according to their exact genie 
contexts (coding-exonic, UTR or intronic), and just note 
in passing that certain elements are expected to have cer- 
tain contexts, e.g., IREs are supposed to be part of the 
UTRs, and we were not always able to confirm this. 

Curation based on the detection of Poll! and Pollll 
promoter sequences 

In the preceding sections we presented the curation of 
specific classes of structured RNAs. We now present 
a more general class of ncRNAs, namely those where 
active genes are expected to have PolII or PolIII promoter 
sequences. This class includes ncRNAs such as snRNAs, 
vault RNA, RNaseMRP and Y RNAs. 

A number of ncRNAs may be curated based on the pres- 
ence of PolII/PolIII promoter sequences. For this we used 
Position Weight Matrices (PWM) extracted from human 



sequences to detect PolII- and PolIII-specific sequence 
elements derived from known human PolII and PolIII 
transcripts (see Additional file 1: Table S21-S26 and 
Methods section for details). Based on the PolIII-PWM, 
a total of 19 out of the 800 U6 candidates returned by 
the Inf ernal/Rfam search, were selected. Among these 
5 are found to be well conserved in the same genomic 
location in other species, and 5 are located in introns. 
Single instances of U6atac, RNaseMRP, RNaseP_nuc were 
similarly found. Finally even though two 7SK were found 
to have satisfactory TATA-boxes and Proximal Sequence 
Elements (PSE), only one sequence was highly similar to 
other mammalian 7SK sequences. 

For snRNAs, 17 Ul, 14 U2, 3 U3, 6 U4, 7 U5, 2 U7, 2 
U8, 1 Ull, 1 U12 and 2 U13 were found to have PolII pro- 
moters sequences. In case of U4atac, the PSEB and PSEA 
segments were spatially inverted, i.e. PSEB was located 
upstream of PSEA. Still, due to the high conservation of 
U4atac in all other mammals, U4atac was kept in the 
annotation. 

The structure based homology search picks up 3 vault 
RNAs. Two copies on chromosome 2 where one is an 
active copy with a degraded copy nearby conserved in pig, 
cattle, and dolphin (in pig it is not even part of the low con- 
fident annotation, but found in the complete set of BLAST 
supported Inf ernal/Rfam results with an Infernal 
E-value of around 0.2). It appears to be a common thread 
amongst the species within the Laurasiatherian lineage 
that only one good copy of the vault RNA gene exists 
within the vault RNA cluster, unlike the one in human 
which feature 4 good copies (chr2 in pig, chrl4 in horse, 
chr7 in cow, scaffoldllll77 in dolphin, chrll in dog, 
chrAl in cat, and chr5 in human). A high scoring pseudo- 
gene is found on chromosome X, in pig, as well as human. 
A lower scoring vault RNA is found on chromosome X 
in cow, and possibly in dolphin, however in the latter 
the genome is not assembled on chromosomes. Note that 
these pseudogenic copies on chromosome X cannot be 
related through the genome wise pairwise alignments. 

Y RNAs are Pol-III transcripts with the canonical copies 
transcribed from a single gene cluster, see [45]. We find 
8 Y RNAs in the pig genome using structure based 
homology search. 4 are low-scoring and are scattered 
over the genome. In most mammalian genomes they are 
accompanied by a large number of retro-pseudogenes 
[46]. In the pig these numbers are relatively small (8 Y 
RNA candidates detected in pig in total with Infernal 
and the cutoffs applied for the high confident annota- 
tion, compared to 690 in human detected with the same 
cutoffs). 

The pig genome harbours a complete Y RNA cluster: 
The arrangement on pig chromosome 9 Yl(+ strand), Y3(- 
strand), Y4(- strand), Y5(- strand) conforms to the ances- 
tral eutherian prototype (with Yl inverted relative to the 
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ancestral mammalian arrangement). It is unclear, how- 
ever, whether the Y5 locus is still functional in the pig 
genome or already a pseudogenized remnant. The diver- 
gence from the human sequence is much higher than that 
of the other three pig Y RNAs, indicating a drastic reduc- 
tion of stabUizing selection: Yl 1.8%, Y3 3.0% Y4 4.3% Y5 
13.6% based on UCSC browser repeat annotation. This 
is to be compared with the situation in cow and dolphin 
were Y5 is deteriorated to a degree were it is almost cer- 
tainly non-functional. To further test this we tested the 
Y RNAs with cmalign for all seven organisms includ- 
ing pig where the pairwise alignment points to complete Y 
RNA clusters. The seven organisms are pig, cow, dolphin, 
horse, dog, human and rabbit. Alignment to the covari- 
ance model (CM) clearly shows that Yl and Y3 fits the 
model consistently in all organisms, while the Y4 and Y5 
RNAs are all somewhat divergent in the pig/cow/dolphin 
part of the tree. However, only the cow and dolphin Y5 
RNAs scores badly in the Rfam Y RNA model. The Y 
RNA alignments in Stockholm format are available on the 
website of this paper. 

This section concludes the discussion about specific 
classes of structured RNAs and the curation of these 
RNAs. We proceed with a summary of the complete 
curated annotation 

Curated annotation 

The high confident annotation is a combination of the 
results of the high confident homology pipeline and the 
miRDeep detected miRNA candidates. We obtained a 
total of 3,556 high confident loci (Table 7). Using a variety 



of methods presented above we were able to curate 13% of 
the annotated RNA loci while 37% of the annotations are 
marked as pseudogenes. Pseudogenes are either ncRNAs 
expected, but failing, to be PolII or PolIII transcripts, 
cis-regulatory elements not in the vicinity of annotated 
protein coding genes, or ribosomal RNAs not part of the 
cluster on chromosome 6. 

The curated annotation presented in Table 7 contains 
the 571 genes that passed the curation procedures pre- 
sented in the preceding sections. The same curation 
procedures marked 1,581 ncRNA genes as pseudogenes. 
The curated annotation represents a substantial manual 
annotation effort, which considered several aspects of the 
different classes of structured RNAs discussed in the pre- 
vious sections, including the context of protein coding 
genes, clustering of neighbouring RNA loci, synteny, as 
well as sequence and structural conservation of the anno- 
tation in other organisms. The curated structured RNAs 
are clearly those that comes with the highest certainty, 
however, fields such as small RNA-seq, will benefit from 
the high confident annotation, and in some cases even 
from the medium or low confident annotations. 

Discussion 

Annotating genomes for functional elements is in general 
a non-trivial problem, in particular, if this is considered 
to include full functional annotation. A recent conserva- 
tive effort to characterize proteins, CharProtDB, contains 
only 213 human entries [47], approximately one percent of 
the total number of protein coding genes. Clearly the use 
of sequence homology to screen the genome for known 



Table 7 Curated annotation 





# high confident loci 


# curated loci 


# pseudogenes 


# loci,#pseudogenes subtracted 


cisreg-elements 


139 


93 


31 


108 


IncRNA-loci 


58 


0 


0 


58 


miRNA 


369 


125 


0 


369 


putative-miRNA 


155 


0 


0 


155 


ribozyme 


8 


1 


5 


3 


rRNA 


185 


3 


182 


3 


snoRNA 


638 


269 


278 


360 


snRNA 


1,030 


69 


960 


70 


tRNA 


810 


0 


0 


810 


Other 


153 


3 


125 


28 


Conflict 


11 


8 


0 


11 


Sum 


3,556 


571 


1,581 


1,975 



The high confident annotation is a combination of the results of the high confident homology pipeline and the miRDeep results. See Table 2 for row labels. Column 
labels: high confident is the high confident annotation prior to curation; curated are the number of loci curated by methods explained in the text; pseudogenes are 
the loci expected to be Polll/Pollll transcript, but failing to be so, ribosomal RNAs not part of the cluster on chromosome 6, and cis-regulatory elements without gene 
context. In the column overlaps to structured RNA loci annotated by homology as well as putative miRNAs are given. Curated annotation contains loci that are a) 
curated or b) loci not tested in the curation procedure, e.g., miRNA loci. High confident: is the complete high confident annotation (homology + miRDeep). 8 miRNAs 
detected by miRDeep, but not by high confident BLAST where re-annotated in the section miRNAs in the p/g genome. A table with the 3,877 medium confident loci 
and 36,647 low confident loci are found In Additional file 1 : Table 520. 
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sequences is useful. However, in contrast to protein cod- 
ing genes, search for structured RNAs is a much harder 
task, partly due to the long ranging interactions in RNA 
(secondary) structure. 

Other protocols for annotating ncRNA exists, such as 
RNAspace, but it is not feasible for full genome-wide 
analysis of a mammalian genome [48] . 

Our pipeline utilized sequence and structure homol- 
ogy search, small RNA-seq and de novo RNA structure 
prediction in genomic sequence. We obtained high con- 
fident annotation of 3,556 structured RNAs distributed 
over 3,348 (full length) ncRNA genes, 11 conflict loci, 139 
cis -regulatory elements, as well as 58 IncRNA loci. The 
11 conflict loci are all identified as very good matches to 
either snoRNAs, to SRP RNAs or to the rRNA subunit 
28S. However, part of these RNA sequences have been 
identified as miRNAs in other studies and added to miR- 
Base. Deciding if the 11 conflict RNAs can also serve as 
miRNAs is outside the scope of this study. We chose to 
keep these annotations as high confident as we have no 
doubt about their annotation, only their possible func- 
tion as miRNAs are yet to be determined. Note, that 5 
of the conflicts are also annotated as conflicts between 
snoRNAs and miRNAs in human or mouse; 3 conflicts are 
found by high confident BLAST to be between miRNAs 
and seed members of the SRP RNA family; and the anno- 
tation of 28S is beyond dispute as it is part of a full rRNA 
cluster. This leaves 2 conflicts unaccounted for, which 
are the results of more recent additions to miRBase from 
organisms besides mouse and human. 

We found 524 miRNA loci from 465 different (puta- 
tive) miRNAs precursor sequences, of which 174 were 
annotated for pig already. 151 are so far annotated only 
in pig, however, only 5 appear to be confined to the 
Laurasiatheria, 4 found only in pig and one specific to the 
cow/pig/dolphin part of the tree. 

The sequence based part of the pipeline complete the 
highly conservative part of the pipeline, by employing 
BLAST cutoffs of 0.1 in E-value, 95% identity with a 
coverage of at least 95% coverage of the RNA query 
sequence. Correspondingly, for the Infernal screen we 
employed only the family specific gathering score cutoffs 
and ran Infernal directly on the genomic sequence, 
which is in contrast to the official Rfam annotation 
where BLAST pre-filtering is performed. Our Infernal 
screened yield 709,026 loci matches, and we found that 
BLAST filtering as well as additional Infernal E-value 
filtering were needed to obtain confident results. Our 
results are not directly comparable to those of Rfam, 
because the exact filtering procedure used by Rfam 
is unpublished and thereby not directly available for 
assessment. 

The high confident annotation is the result of an 
automated pipeline, while the manual annotation is 



presented in the form of the curated genes. In particu- 
lar the results of the Infernal and snoStrip results 
will contain pseudogenes and the Infernal results 
may contain false positives as well. We therefore ran all 
tools on a local (120 nt window) shuffled version (see 
Methods) of the pig genome to obtain an estimation of 
the false positives on random data. We found that most 
tools, even at their lowest confidence levels produced no 
results on the shuffled sequence. This includes, RNAmmer, 
tRNAscan-SE, snoStrip, and BLAST (with the excep- 
tion of 6 very short CRISPR sequences not expected in 
vertebrate genomes). The exception is Infernal, which 
produces a high number of hits on the shuffled sequence 
without filtering in addition to what is offered by the fam- 
ily specific gathering score cutoffs. However, by introduc- 
ing a global Infernal E-value cutoff of le-3 in addition 
to the gathering score cutoff, we were able to cut the 
number of false positives in the shuffled data down to 
just one low confident hit and no high confident hits. A 
second potential problem using Infernal is the pos- 
sibility for incorrect family assignment {e.g., a snoRNA 
or miRNA assigned with a wrong family name). How- 
ever, with zero conflicts in the high confident Infernal 
results on the unshuffled data and only 7 within the low 
confident results, this appears not to be a problem with 
Infernal used as we do here. 

This work represents a substantial computational effort. 
While most of the annotation tools, and the post pro- 
cessing could be run for the whole genome on a fast 
stand-alone workstation, two major parts of the pipeline 
precludes this. Firstly, running Infernal for all Rfam 
models takes around 25 CPU years. Secondly, the pair- 
wise alignments take in the order of 1 CPU year per 
genome we aligned against the pig genome. Infernal 
has recently achieved a substantial speed up by taking 
full advantage of the recent developments in the HMM'er 
code [49]. The improved HMM'er filtering will hopefully 
replace the need for BLAST filtering of the Infernal 
results since a much better false discovery rate is reported 
in the cited paper. However, we will have to wait for 
a later version of the Rfam database as even the lat- 
est version (11.0) is constructed with Infernal version 
1.0.2. With the improved Infernal code the annotation 
part of the pipeline could be run on a normal work- 
station within a few weeks. However, the pairwise and 
multiple alignments, and thereby the synteny analysis 
will still require access to a high-performance computing 
cluster. 

To further ensure confidence, we manually or semi- 
automatically inspected a selected subset of the 3,556 
annotated RNA loci to reach a total of 571 curated struc- 
tured RNA and 1,581 pseudogenes. One class of genes 
inspected was Y RNAs for which the inspection gave rise 
to the observation that the Y5 loci, in contrast to the 
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others, already might have mutated into a non-functional 
state. 

Ensembl version 68 provides ncRNA annotation for 

the pig genome as part of their annotation pipeline. 
Ensembl version 68 contains 2,965 structured RNA loci 
for the pig genome divided on rRNAs, miRNAs, misc- 
RNAs, snoRNAs and snRNAs. Of the 3,418 high confi- 
dent annotations obtained by our homology search only 
1,955 are present in the Ensembl annotation. In par- 
ticular all tRNAs are missing entirely or are annotated 
as miRNAs, though the basis for annotation is unclear 
to us. Likewise the cis-regulatory elements are miss- 
ing completely from the Ensembl annotation. In the 
other end of the spectrum, we find that 522 of the 
2,965 Ensembl annotations are not found in the low 
confident annotation. The ncRNAs found extra in the 
Ensembl annotation are in most cases miRNAs without 
an assigned gene name, indicating that they are proba- 
bly low confident miRNAs from Rfam. A comparison of 
the two annotations is found in see Additional file 1: 
Table S15. 

To further search for de novo RNA structures we 
employed RNAz [21]. RNAz uses a sliding window based 
approach which chop out sections (typically of 120 nt) of 
multiple (sequence) alignments and measure the struc- 
ture potential over a background in that window. A main 
concern is the input MultiZ MAE blocks, which are 
typically used as outset. The more organisms the MAF 
blocks contain, the more fragmented they become and 
the input space is not covered optimally. Another con- 
cern is the heterogeneous distribution of the evolutionary 
distances between the organisms in the alignments. Ide- 
ally RNAz should be fed with equally distanced organisms. 
The RNAz search provided 83,869 structurally conserved 
RNA loci of which 528 overlapped with the high confident 
homology based annotation or novel miRNAs. The class 
best covered by RNAz loci were the miRNAs, where 65% 
of the loci were recovered. Most RNAz loci were found in 
the intergenic or intronic regions, however, around 10% 
were found near the beginning or end of protein coding 
transcripts. 

A second source of novel candidate RNAs in pig 
comes from the small RNA-seq data presented in 
this work. Here, the main analysis of the sequenc- 
ing data is performed with miRDeep and we therefore 
only annotate novel miRNA candidates. However, with 
deepBlockAl ign (DBA) we also found novel RNA can- 
didates in other classes of RNAs, but without more com- 
putational support or experimental evidence these are not 
annotated with high confidence and are therefore only 
presented as a special track. For miRNAs, the RNA-seq 
profiles are particularly specific and the mi r DBA tool [35] 
for aligning miRNA RNA-seq profiles directly against a 
compiled database of RNA-seq profQes was developed in 



parallel with this work. Future updates will use the full 
potential of mirDBA. 

Early versions of the pig genome have been investi- 
gated for ncRNAs, these studies investigated a low (0.66x) 
coverage version of the genome [11] in which 51 miR- 
NAs were identified and a later study investigated the pig 
ESTs [13] in which 8 additional miRNAs not found in the 
previous study were identified. We confirm all these 59 
miRNAs in this paper, except for 3 where the sequence is 
missing in the current assembly of the pig genome. The 
study of the low coverage version of the genome did not 
include a search for other types of ncRNAs, while the 
ESTs from the second study were selected with a bias 
against small ncRNAs and a comparison of the obtained 
full ncRNA annotations with the present is therefore not 
entirely meaningful. 

Here, we focused on annotating structured RNAs (ncR- 
NAs and RNA structural regulatory elements in UTRs of 
mRNAs) and we did not include IncRNAs with the excep- 
tion of partial IncRNAs included in Rfam. LncRNA in 
contrast to the many structured RNAs are just emerging 
and systematically characterized by genomic coordinates, 
e.g., [50]. These IncRNAs have been obtained through 
comprehensive transcription studies. Given that IncRNAs 
in general are poorly conserved and similar studies have 
not been carried out in pig, we in this study focused on 
structured RNA (of which some de novo RNA structure 
candidates may overlap IncRNAs). 

Overall our ncRNA annotation is of further interest in 
the light of pig as production and model animal. The 
known ncRNAs holds potential to be studied in specific 
problems, where for example mapped RNAs can be use- 
ful in subsequent studies, such as differential expression 
of ncRNAs, e.g., in necrotic lung tissue [51] and skele- 
tal muscle tissue [52]. Furthermore, as next generation 
sequencing will make it possible to increase the quality 
of Quantitative Trait Loci (QTL) and genome-wide asso- 
ciation studies, these regions can be analyzed for ncRNA 
content. For example we considered all QTL contribut- 
ing regions less than 500,000 nt listed in the pigQTLdb 
[53] and by Fischer Exact test we found 55 QTL regions 
to be enriched (p<0.01) with RNAz candidates (RNAz p- 
score>0.9) (data not shown). A similar, but weaker trend 
is observed for a few cases when testing the enrichment 
for only the annotated families and classes of structured 
RNAs. 

As an example of what one can do with an ncRNA anno- 
tation in relation to a larger set of genes, we analyzed the 
46 regions of the pig genome containing olfactory receptor 
genes [54]. The regions overlapped with 249 of our high 
confident ncRNA candidates, including, 170 tRNAs, 8 
known miRNAs, 1 miRNA predicted by miRDeep, 25 
snoRNAs and 7 cis-regulatory elements. In particular the 
miRNAs and cis-regulatory elements could be involved in 
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regulation of the olfactory receptor genes. The genome 
structure can be relevant for further studies aiming at 
mutually to elucidate the functionality as well as the mech- 
anisms underlying these traits. 

Conclusions 

We have conducted the, to date, probably most compre- 
hensive ncRNA annotation of a mammalian genome in a 
single analysis, and we have obtained an extensive anno- 
tation of structured RNAs in Sus scrofa. We constructed 
a semi-automated pipeline enabling manual curation at 
reasonable level while also incorporating RNA-seq data 
and including a layer of de novo RNA structures. Over- 
all we annotated 3,556 high confident structured RNAs 
of which a subset was inspected leading to 571 curated 
annotations. In addition we found 83,869 structurally con- 
served loci using RNAz, of which 528 overlapped with 
the homology based annotation or novel miRNA can- 
didates. These structured RNAs constitute an essential 
resource for studying and generating hypothesis of genetic 
mechanisms underlying the various production traits. All 
the data resources are available at http://rth.dk/resources/ 
rnannotator/susscrl02/versionl.02. 

Methods 

Example command lines and program options for call- 
ing of the different tools are given in Additional file 
1: Table S16 for the annotation pipeline in general. 
Additional file 1: Table S17 for the Infernal runs, and 
Additional file 1: Table SIS for the alignment pipeline. 

Sequence based homology search 

The sequence based homology search was performed 
using blastall (blastn) from the BLAST [23] tool 
version 2.2.23 and a number of databases (miRNA hair- 
pins from miRBase (version 18.0) [55-58], the vertebrate 
rRNAs from Silva (version 102) [59], snoRNAs from 
snoRNAbase (version 3) [60], tRNAs from tRNAdb (ver- 
sion 2009) [61], and seed sequences from Rfam (version 
10.1) [7,8]). The databases were not redundancy reduced 
prior to the search and there may therefore be overlap- 
ping hits on a given genomic position. Redundant overlaps 
were resolved as follows: Within each database the hits 
were classified as miRNAs, tRNAs, snoRNAs, snRNAs, 
rRNAs, cis- regulatory elements, or as other for those 
without a clear classification while still keeping track of 
the origin. Within the classified hits of each database the 
hit having the lowest BLAST E-value was chosen. For 
example, SNORA81 resides on chromosome 13 in pig. 
Two overlapping high confident BLAST hits from the 
Rfam version 10.1 seed sequences were found. One from 
small brown bat, and one as it turns out from pig. Not sur- 
prisingly, the hit having the lowest E-value is the sequence 
from pig. At the same genomic position and strand, a 



high confident hit from miRBase is found, eca-mir-1248 
is found. The sequence was thus annotated twice, once as 
a snoRNA and once as a miRNA. Therefore this region 
will be marked as a conflict. For the high confident results, 
we required that a minimum of 95% align length and 
95% identity of the BLAST hit, For the medium confident 
results, we required 92.5% length and 92.5% identity, and 
for the low confident results 90% length and 90% iden- 
tity. All BLAST searches were performed with an E-value 
cutoff of 0.1. 

Structure based homology 

The structure based homology search was based on the 
Infernal 1.02 [24] models from Rfam 10.1 [8]. A scan of 
the complete pig genome without any BLAST pre-filtering 
and with all models was performed. The Infernal 
options for a default search is given in the CM model files 
provided by Rfam, and depends on the model with the 
following parameters as variants: -g (global/local(default) 
alignment), -T (score cut off, set at the so-called gather- 
ing score cutoff for the given model), and -fil-no-hmm 
(turns of HMM pre-filtering, speeding up the search at the 
cost of sensitivity). The Infernal options were used as 
specified in the CM model file, except for three models 
Intron gpl, RNase_MRP, and SSU_rRNA_bacteria where 
HMM pre-filtering was enforced because of the excessive 
time consumption posed by these models. 

The Rfam families are generally constructed with sensi- 
tivity in mind. Each family comes with a gathering score 
cutoff, which is the recommended cutoff for scanning for 
new members of the family. This cutoff will by the very 
construction be too sensitive, but it is the only alternative 
that Rfam provides. 

Initially the scan was performed with the models of 
Rfam version 10.0 against a pre-release of the pig genome 
version 10. Models added or updated in Rfam version 10.1 
were later applied to the full genome (pig genome ver- 
sion 10.2). The unchanged models from Rfam version 10.0 
were applied to new or altered sequence between the pre- 
release of the pig genome and the final version of the 
genome otherwise used in this paper. Because of this, the 
exact run time for the pig genome is unknown, however, 
a corresponding scan of the human genome took 25 CPU 
years. The latest version of Infernal is claimed [49] to 
be a 100 times faster than the version used in this study, 
which would put the entire annotation pipeline within 
reach of a small computational cluster. 

The raw output of the Infernal scan of the Rfam fam- 
ilies on pig yielded a total of 709,026 loci, where 8% of 
these had conflicting Rfam annotation. Additional filter- 
ing is obviously necessary even at the lowest of confidence 
levels. The first thing one should notice is that the con- 
struction of the families and setting of the cutoffs by 
the Rfam consortium includes a BLAST filtering step, 
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and the exact procedure of which is unpublished. This 
vastly enhances the selectivity of the families and allows 
much lower Infernal score cutoffs than maybe reason- 
able if the families are used without BLAST pre-filtering, 
which for the pig genome results in the 709,026 matches 
reported above. 

We therefore chose to apply 4 additional filters as 
explained below, 1. using only families for which we 
expect a hit {i.e. only families with seeds in vertebrates), 
2. Infernal E-value filtering for families with very loose 
E-value cutoffs implied by the gathering score, 3. a BLAST 
filter, 4. special filtering for the miRNA families. 

Filter 1. Not all families are known for vertebrate 
genomes. We limit the analysis to the 695 families with 
seed sequences in vertebrates. 

Filter 2. We normally used the family's gathering score 
as cutoff However, in some cases this cutoff was very 
loose when translated to Infernal E-values. The score 
cutoff on an Infernal CM model implies an equiva- 
lent Infernal E-value cutoff The relationship between 
the two is found by calibration of each family. In some 
cases we find the model gathering score cutoff to imply 
Infernal E-value cutoffs up to around le6 (the extreme 
case is S_pombe_snR97). For non-miRNA families we 
therefore impose an additional Infernal E-value cutoff 
of le-3 in the cases where the E-value cutoff implied by 
the gathering score cutoff is above le-3. The cutoff was 
chosen because with stricter E-value cutoffs we quickly 
start to remove valid candidates from tRNA and snoRNA 
families. 

To further justify the global E-value cutoff, we ran the 
514 Infernal models with (low) confident hits in pig 
on shuffled sequence (see the Subsection False positive 
rate of the homology search pipeline for details on the 
shuffling procedure). 208 or about 40% of the families pro- 
duced hits in the shuffled sequence at their family specific 
score cutoffs. The worst being mir-684 with 3,212 hits 
and 5S_rRNA with 719 hits. When we reduce the global 
Infernal E-value cutoff, we gradually reduce the num- 
ber of families that have matches in the shuffled sequence 
to 161 at an Infernal E-value cutoff of 1 where the 
worst family is Histone-3-prime-UTR with 47 hits. Down 
to only two hits in two different families at a cutoff of 
le-3 (See the graph in the Additional file 1: Figure S21). 
This means that we only observe two false positives in the 
shuffled sequence at the low confident cutoff prior to the 
BLAST filtering. At our the cutoffs used for the low con- 
fidence annotation a single false positive is found, while 
none are found at the more strict cutoff levels. 

Filter 3. The specificity and sensitivity of a family is 
highly dependent on the BLAST filtering, and using the 
models score cutoff alone will often be far too loose. 
However, even a very loose BLAST filter was suffi- 
cient to greatly enhance specificity without affecting the 



sensitivity of most models. That is the sequence based 
homology filter anchored the families to the right genomic 
locations and greatly reduce the number of hits. For 
each family we applied a BLAST filter based the family's 
seed sequences. We tested the implications of a com- 
mon BLAST E-value cutoff for all non-miRNA families. 
By gradually relaxing (increasing) the cutoff, we found 
that conflict of annotation occurred at an E-value of le- 
1 (two different Rfam families annotated the same locus). 
By gradually decreasing the cutoff we found that with a 
BLAST cutoff value of le-5 we loose valid annotations. 
The worst case is the Histone-3-prime-UTR family, where 
17 of 21 matches are lost and since these 17 matches are 
part of three clusters of Histone-3-prime-UTR annota- 
tions, we believe them to be valid family members. We 
therefore settled on an intermediate BLAST E-value cutoff 
of le-3 for all non-miRNA families, which by observation 
corresponds to an exact sequence match of length 24. A 
100% identical sequence match of this length was con- 
sidered sufficient to filter the structure matches down to 
those matching the correct Rfam family by being in the 
order of the shorter sequence motifs, e.g., the length of a 
mature miRNA. The BLAST cutoff reduced the number of 
loci with conflicting families to zero. For the medium con- 
fident results we filtered with an E-value of le-1, which is 
the limit of the BLAST run we did above. And for the low 
confident results, the BLAST filter is turned of completely. 
Note that, the HMM filtering that some families provide 
is in a sense a refinement of the BLAST filtering. 

Filter 4. We found the miRNA families to be particularly 
troublesome Rfam families. This may not be too surpris- 
ing given their simple hairpin loop structure, which is 
ubiquitous in the mammalian genomes. In any case, we 
exempt the Rfam miRNA families from the high confi- 
dent results, both for the structure similarity search and 
for sequence similarity search. We do, however, include 
the Rfam miRNA families at the medium and low confi- 
dent annotation levels. See also the subsection RNAs of 
the individual annotation tools for a specific example of a 
problematic mir-family. 

While, we apply a maximum Infernal E-value cutoff 
of le-3 for most families, we can use more strict cutoffs 
of le-6 or le-9, for low and medium confident annotation 
of Rfam miRNA families since we observed that the seed 
sequences of the miRNA families in general have much 
lower Infernal E-values. However, with Infernal E- 
value cutoffs lower than le-9 we loose valid miRNA family 
members observed with high confident BLAST against 
miRBase. We furthermore require a BLAST cutoff at all 
confidence levels. Without this requirement, we found 
that about 40% of the miRNA loci would contain conflicts 
even after imposing the extra Infernal E-value filters. 

In summary, annotating full vertebrate genomes with 
Rfam and Infernal is a difficult task, albeit without 



Anthon etal. BMC Genomics 2014, 15:459 
http://www.biomedcentral.eom/1 471-21 64/1 5/459 



Page 23 of 27 



Infernal it is hard to annotate structured RNAs at all (a 
specific example is discussed in the Results section under 
snoRNAs). The automatic annotation that we present in 
here is, even with the extra filters, is likely to contain a 
few pseudogenes or in worst case uncertain annotations. 
Thus, the curated genes always contain the annotation 
with the highest certainty. 

Class specific tools 

RNAmmer [25] version 1.2 tool were used for rRNA 
detection and the results were used as is. In this mode the 
tool detects a number of 8S, 18S and 28S fragments which 
will need hand curation. 

tRNAscan-SE version 1.23 [10] was used to search for 
tRNAs. For mammalian genomes it is often necessary to 
apply additional filtering of the tRNAscan-SE results. 
For the high and medium confident results we apply the 
filtering used for the cow genome in the database on the 
tRNAscan-SE website: The high confident results we 
report are filtered by HMM Score 10.0, 2'Str Score 5.0, 
and COVE Score 20 when the tRNA is found by both first 
time parsers and by COVE Score above 55 when the tRNA 
is only found by one. The low confident results are the 
unfQtered results of the tRNAscan-SE. 

snoStrip [26] in a pre-release version was used as 
a specialized tool for identifying snoRNAs from incom- 
plete BLAST hits. The BLAST search for snoRNAs was 
done using all snoRNAs from an internally curated 
snoRNA database. BLAST-hits were accepted as homolo- 
gous snoRNAs with a sequence identity higher than 85% 
and a minimal length of at least 90% of the given query, 
the £-value cutoff was le-10. Iteratively, we searched 
with this method as long as we were able to find new 
snoRNA homologs. In order to further remove pseudo- 
genes, snoRNA-boxes were scored based on PWM gen- 
erated with the curated snoRNA sequences contained in 
the database. The false positive rate of the snoStrip tool 
was gauged by running the release version on a shuffled 
version of the pig genome. No results were found in the 
shuffled sequence when running snoStrip at the same 
cutoffs as those applied on the real pig genome. 

Detection of active Pollll and Polll transcribed RNAs 

PolIII transcribed RNAs like U6 may behave like repeti- 
tive elements, making it difficult to know which annotated 
element undertakes the expected cellular function. 

Based on the assumption that only transcribed elements 
should actually be functional, we devised a method based 
on the detection of the Proximal Sequence Element (PSE) 
and TATA-Box upstream of the annotated transcript, two 
sequence elements that are essential for the successful 
transcription of PolIII elements (see Additional file 1: 
Figure S22). A slice of 100 nts directly upstream of the 
annotated element of curated human PolIII transcripts 



(7SK, U6, U6atac, Y RNA, RNaseP RNaseMRP), was taken 
and searched for PSE and TATA (See Additional file 1: 
Table S21) 

A strong TATA box signal was observed at the 3 end of 
the upstream region of the full set of high confident U6 
snRNAs for the preliminary annotation when compared 
to a set of 1,000 random sequences(see Additional file 1: 
Table S23, Table S22 and Figure S23). 

Similar to PolIII transcripts, PolII transcripts also pos- 
sess promoter sequences upstream of the transcript. 
Those are called Proximal Sequence Element A, PSEA, 
(-50 nts upstream) and Proximal Sequence Element B, 
PSEB, (-25 nts upstream) (see Additional file 1: Table S24 
and Figure S24). 

It should be noted that PSEA with high scores on 
the set of raw PolII transcripts is significantly different 
from the PSEA distribution of the random sequences (see 
Additional file 1: Table S25, Table S26 and Figure S23). In 
the case of the putative PolII transcripts, the high-scoring 
PSEA elements preferentially locate 50 nts upstream of 
the transcript start. 

Based on these PWMs the reported PolIII (U6, U6atac, 
7SK, RNaseP_nuc, RNaseMRP) transcripts were filtered. 
Finally, alignments of the selected sequences were gener- 
ated either with clustalw [62] and/or cmalign [24] 
in order to better spot sequences that might not be 
functional. 

Genome alignments 

The pairwise alignments were performed using LASTZ 
version 1.02, and the alignments were subsequently 
chained using Kent's axt Chain, which is part of the 
UCSC tool chain with code base dated September 2011 
[30]. 

The alignment is performed with LASTZ [29] on smaller 
chunks of the genomes involved. The target genome(pig) 
is cut down to fragments of at most 80,010,000 with an 
overlap between chunks of 10,000 nucleotides. The source 
genome (see Table 4 for at list of source genomes) is cut 
down to fragments of at most 80,000,000. The parameters 
used for the LASTZ alignments are divided into closely 
and distantly related genomes. See Table 4 for which 
genomes implies which parameters and Additional file 1: 
Table S19 for the exact options used. 

The LASTZ alignments are chained with the axtChain 
tool and merged into one big chain file containing all 
the chained alignments(chains). These chains may overlap 
due to duplication of sequence in either target or source. 
The chained alignments were converted to single coverage 
alignments(nets) of the pig genome using chainNet. The 
netted alignments, which are only single coverage of the 
pig genome not the source genomes, were cleaned for 
non-syntenic alignments with net Filter or by forming 
reciprocal best nets, in which only the best alignment is 



Anthon ef al. BMC Genomics 20 1 4, 1 5:459 
http://www.biomedcentral.eom/1 471-21 64/1 5/459 



Page 24 of 27 



kept in case of multiple coverage of the source genome. 
(axtChain, chainNet and netFilter are all part of 
the UCSC tool chain [30]). 

The pairwise alignments were joined to a multi- 
ple alignment with the roast program from the 
Threaded-Blockset -Aligner (TBA/MultiZ) [36]. 
Default parameters were used for the program, i.e., a 
dynamic programming range of 30 and a minimum MAF 
block size of 1. The pairwise alignments were cleaned for 
synteny prior to running roast; either by using the syn- 
tenic nets or by using the reciprocal best nets in the cases 
of many unplaced contigs in the source genome, a scaf- 
fold based genome, or low coverage of the pig genome. 
See Table 4 for the nets used. A phylogenetic tree without 
branch lengths is required as part of the input to roast, 
see Figure 3. 

The UCSC tool chain allows an automatic cleanup of 
the best single coverage alignments of the pig against 
other genomes in which alignments of gap-filling char- 
acter has been deleted. These so-called syntenic nets are 
ideal for identification of synteny. The synteny blocks 
from the syntenic nets were identified by merging neigh- 
bouring alignments with identical source chromosome, 
target chromosome and strand information. The align- 
ments were only merged when the gap between them were 
smaller than 5,000 nt. Only merged alignments at least 
5,000 nt long were used in the further analysis. 

RNAz 

RNAz version 2.1 [21] was used for de novo prediction. 
RNAz identifies loci of conserved structure based upon a 
multiple alignment. RNAz ran on windows of the MAF 
blocks from the multiple alignment. Windows were of size 
120, step size 40. A maximum of 6 sequences were allowed 
in each window. In case of more than 6 sequences in a 
window, the alignments were optimized for an average 
sequence identity of 80%. RNAz was used with the di- 
nucleotide option and with strand identification. At least 
three sequences were required to make a prediction. The 
results are filtered according to a RNAz p-score 0.9. The p- 
score is a reliability score for the RNA prediction and not 
a p-value. RNAz is known to have a high false positive rate 
[21] and we therefore ran RNAz on MAF windows shuf- 
fled using multiperm [63]. We found false positive rates 
of 61% and 52% with p-score cutoffs of 0.5 and 0.9, respec- 
tively (with the false positive rates defined as the number 
of shuffled windows with RNA structures divided by the 
corresponding number on the unshuffled windows). In 
accordance with the original paper where the respective 
false positive rates where 59% and 54% [21]. 

False positive rate of the homology search pipeline 

To gauge the false positive rate of the pipeline, we cre- 
ated a shuffled version of the pig genome and ran the 



prediction tools with the original settings on the pig 
genome. 

The approximate size of the typical structured RNAs 

that we find is around 120 nt. Accordingly, the contigs 
of the pig genome were shuffled in 120 nt windows with 
uShuf f le [64] set to preserve the di-nucleotide con- 
tents. The shuffled windows were then put together in the 
corresponding order in the contigs to create a shuffled 
pig-genome with the exact same (di-)nucleotide contents, 
local GC content bias and contig size as the unshuffled 
one. The false positive rate of the tools were then found as 
the number of hits obtained in the shuffled genome. 

Preparation of small RNA libraries 

Total RNA was isolated from liver, lung, kidney, colon, 
small intestine, spleen, lymph node, cerebellum, frontal 
lobe, and placenta of Pinky, a genetically identical clone 
of TJ Tabasco, using mirVana miRNA'" Isolation Kit 
(Ambion). Ten micrograms of RNA from each tissue 
was separated on 15% polyacrylamide Tris-Borate-EDTA- 
Urea gels (Bio-Rad), stained with SYBR' Safe DNA gel 
stain (Invitrogen) and visualized using a Dark Reader 
Transilluminator (Clare Chemical Research). Small RNA 
in the 15-30 nt range was excised, eluted from the gel slice 
in 0.3 M NaCl, precipitated in ethanol, and finally dis- 
solved in DEPC-treated water. Next, small RNA libraries 
were prepared for next-generation sequencing using the 
AIR™ Small RNA Sequencing Kit (Bioo Scientific Corpo- 
ration) in combination with AIR™ Barcoded Adenylated 
Adaptors (Bioo Scientific Corporation) to enable multi- 
plex sequencing. The 5' adaptor and 3' adaptors were 
sequentially ligated to the small RNA and the ligation 
products were gel purified between each ligation step 
on 10% Tris-Borate-EDTA-Urea polyacrylamide gels. The 
gel-purified ligation products were reverse transcribed 
and amplified by 20 cycles of PCR. The resulting 100 
bp PCR fragments were separated on 4% high resolution 
agarose gels (Metaphor Agarose, Lonza), purified using a 
QIAquick gel extraction kit (QIAGEN), eluted in 30 /xl 
lOmM Tris-HCl pH 8.5 (EB buffer) and dried down to a 
volume of 15 /xL using a SpeedVac. The size distributions, 
qualities and quantities of the libraries were measured 
using a NanoDrop ND-1000 Spectrophotometer and an 
Agilent 2100 Bioanalyzer. Finally, the concentrations were 
adjusted to 10 nM using EB buffer. 

Illumina Genome Analyzer Sequencing and primary 
sequence analysis 

The ten barcoded libraries were cluster amplified using an 
Illumina Cluster Station followed by 50 cycles of sequenc- 
ing on an Illumina Genome Analyzer IIx Sequencer and 
primary data analysis using the Illumina Pipeline software 
(CASAVA 1.7). The sequence reads were first processed 
by demultiplexing and trimming of adaptor sequences, 
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using only reads with full length 3' adaptor sequences. 
Sequence reads shorter than 15 nt were discarded, leaving 
approximately 40M reads for further analysis. Identical 
reads were collapsed into unique reads that were quan- 
tified (558K), and subsequently aligned to build 10.2 of 
the Sus scrofa reference genome using Megablast (-W 12, 
-p 100). Sequences mapping to more than five genomic 
loci were discarded, and only sequences with perfect, 
full-length alignment (100% identity) to the genome were 
retained for evaluation, yielding 210K collapsed reads that 
represent 27. 7M reads. 

miRDeep analysis of the small RNA data 

Annotation of the reference pig genome with known 
and candidate miRNA genes was accomplished with 
miRDeep version 1 [28,65-68]. The sequences of the 
miRNAs will be submitted to miRBase for annotation. 

The raw output from miRDeep indicated 467 putative 
miRNA loci with at least 10 reads present. However, due 
to duplication of genomic sequence in assembled chro- 
mosomes as well as unplaced scaffolds only 421 different 
miRNA precursor sequences, and 381 mature sequences 
were found. (Note: The 421 miRNA precursors were 
given provisional names, pre-1 ...pre-421, pending cor- 
rect names upon addition to miRBase). 

Exact mapping of these 421 sequences resulted in 459 
loci, 7 of them ambiguous because of overlapping pre- 
cursor sequences. In 5 of the seven cases the ambiguity 
could be resolved by noting that both star and mature 
sequences were identical between the conflicting precur- 
sor sequences. In the last 2 cases the star reads were 
missing and both precursors discarded. In one particular 
instance a precursor was reported with to different mature 
sequences, one two nucleotides longer than the other. The 
shortest mature sequence was discarded in this instance. 

The 459 loci were compared to the high confident anno- 
tation obtained in Table 2 and were found to overlap 
with snoRNAs, U5 or 28S in a total of 9 instances and 
the corresponding precursor sequences were discarded. 
Loci overlapping with conflicts already marked in the 
homology search based results were kept. 

In 7 instances of novel miRNAs, we found that the pre- 
cursor sequences overlapped with protein coding annota- 
tion, which led to the move of 6 precursor sequences to 
the medium confident annotation. 

All mature sequences were mapped against all precursor 
sequences to identify misplaced reads. That is, reads map- 
ping to multiple places in the genome or reads incorrectly 
placed because of mapping issues. Careful examination of 
the mapping results lead to the deletion of a further 10 
precursor sequences. 

9 sequences were deleted because they would intro- 
duce conflicts of annotation, and 8 sequences were deleted 
due to overlapping pre-miRNA coordinates. 10 sequences 



were deleted because they originated from misplaced 
reads. Finally 6 sequences were moved to the medium 
confident annotation due to overlap with protein coding 
annotation. Resulting in a final 388 precursor sequences, 
of which 232 where found to have both star and mature 
reads. The 388 precursor sequences mapped to a total of 
431 loci. All details are found in Additional file 2. 

deepBlockAIIgn 

The mapped reads from the small RNA library were 
grouped into block groups using blockbuster version 
1.0 (with parameters: -distance 30, -minBlockHeight 2, 
-minClusterHeight 10, -scale 0.5 -blockHeight abs) [33]. 
The block groups were aligned using deepBlockAlign 
version 1.0, which aligns the relative read expressions and 
arrangements of reads of the two block groups. The align- 
ment score from deepBlockAlign lies between 0 that 
suggests perfect dissimilarity and 1 that suggests perfect 
similarity between the two block groups. We require that 
the length of block groups were at least 50 to make sure 
that structure could be obtained from the read profile and 
for example singular peaks of stacked reads could be fil- 
tered out). deepBlockAlign scores are only considered 
meaning when >= 0.6. This cutoff is based on an empir- 
ical study of two biological replicates, where 95% of the 
read profiles had an alignment score >= 0.6 between the 
two samples. 

Note added in proof 

During completion of this paper a release version of the 
program snoStrip was published. We have employed 
this updated version, which resulted in minor changes for 
the annotated snoRNAs. Thus, the number of high confi- 
dent annotated snoRNAs is now changed from 638 to 621, 
in which the subset of curated snoRNAs now consist of 
289 compared to the previous 269. Overall the total num- 
ber of high confident annotated RNAs has correspond- 
ingly changed from 3,556 to 3,539. These changes are 
now included in a further release, 1.03 (see http://rth.dk/ 
resources/rnannotator/susscrl02/versionl.03/ where we 
describe the changes to the previous version (1.02) pre- 
sented in the paper). 



Additional files 



Additional file 1 : Comprises the Supplementary Tables S1 -S26 and 
Supplementary Figures SI -S24. Additional data and tracks for genome 
browser visualisation also available on http://rth.dk/resources/rnannotator/ 
susscrl02/versionl .02. 

Additional file 2: Spreadsheet containing data for the miRDeep 
detection of miRNAs. 
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